MODEL OF KNOWLEDGE GRAPH FOR A COLLECTION OF MATHEMATICAL ARTICLES
Kazan Federal University (RUSSIAN FEDERATION)
About this paper:
Conference name: 16th annual International Conference of Education, Research and Innovation
Dates: 13-15 November, 2023
Location: Seville, Spain
Abstract:
This paper provides a detailed description of the process of creating a knowledge graph for a collection of mathematical articles in the Russian language. During the creation of this graph, tools were developed for the automatic construction of knowledge graphs for collections that conform to the patterns present in the collection under study. Such a graph will enable the systematic organization and structuring of information stored within the articles in the provided collection. The knowledge graph will facilitate the establishment of connections between articles based on their content, allowing for the identification of hidden relationships and enabling deeper domain analysis. In the future, this graph can serve as a foundation for the development of intelligent systems, automatic classifiers, similar article search engines, and other applications that can be used by researchers, journals and students. It is planned to expand the collection with studies in different languages, this will allow to build relationships between articles in different languages.
A special ontology for the representation of mathematical articles was developed. This ontology is minimal for building the necessary knowledge graph. Developed ontology was aligned with known external ontologies. The classes of the constructed ontology represent the types of objects of the knowledge graph, and the properties represent the links between these objects.
The input to developed instruments consists of a collection of mathematical article files in LaTeX format. These articles are opened in the appropriate encoding. Then, the necessary entities are extracted from the LaTeX code, including Universal Decimal Classification (UDC) codes, publication dates, authors, references, titles, used formulas, and author affiliations. Standard text preprocessing is performed on the article texts. Next, mathematical terms are extracted from the lemmatized texts using the OntoMathPRO ontology, and from the preprocessed texts (lemmatized texts without stop words), document topics are extracted using Latent Dirichlet Allocation (LDA) method.
To extract mathematical terms from the articles, mathematical concepts present in the OntoMathPRO ontology are identified. For this purpose, the labels, i.e., the names of concepts (classes) in Russian, are selected from the ontology. Both the original versions of the labels and their lemmatized versions are retained. Next, for each lemmatized text, it is checked whether a particular lemmatized label exists in it. If it is present, the original version of the corresponding label is added to the list of terms for the document.
Topic modeling was performed using the LDA method on the collection of mathematical articles. Optimal hyperparameters, including the number of topics, were selected using grid search for maximizing the CV Coherence metric. The identified topics were recorded in the knowledge graph through specific properties.
Also, in this study various statistics of the constructed knowledge graph were computed.Keywords:
Knowledge graph construction, Linked Data, Topic modeling, Mathematical Paper.