About this paper

Appears in:
Pages: 9014-9022
Publication year: 2017
ISBN: 978-84-617-8491-2
ISSN: 2340-1079
doi: 10.21125/inted.2017.2131

Conference name: 11th International Technology, Education and Development Conference
Dates: 6-8 March, 2017
Location: Valencia, Spain


A. Galieva1, Z. Vavilova2, V. Gafarova3

1Research Institute of Applied Semiotics of Tatarstan Academy of Sciences (RUSSIAN FEDERATION)
2Kazan State Power Engineering University (RUSSIAN FEDERATION)
3Kazan Federal University (RUSSIAN FEDERATION)
The advantages of using linguistic corpora data in education and research are obvious and well covered in specialized literature. This tool considerably simplifies acquisition of linguistic data and their processing.

Two main corpora have been built for the Tatar language by now, each in open access: the Corpus of Written Tatar compiled in Kazan Federal University, Russia and the Tatar National Corpus (“Tugan Tel”, TT) developed by researchers of the Institute of Applied Semiotics of Tatarstan Academy of Sciences, Russia. These corpora are being hourly replenished; the update of textual collections is mainly carried out through the use of media texts, which provides constant flow of fresh linguistic material.

The paper uncovers the potential of “Tugan Tel” Corpus and its significance for Tatar lexicography. Its textual collection comprises more than 100 million word usages, as recorded in November 2016. The corpus includes texts of various genres, from fiction, media texts, official documents to textbooks and scientific publications. Each document is provided with a meta description tag.

Texts included in the Corpus are supplied with morphological description, i.e. information about the part of speech of the word stem and the set of its grammar features. It is carried out automatically through the use of two-tier morphological analysis module designed with the help of PC-KIMMO programming tool. The system of morphological tags in the Corpus is based on the Leipzig Glossing Rules developed by the Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology together with the Department of Linguistics of the University of Leipzig. In addition, “Tugan Tel” Corpus developers introduced special tags for morphological categories which are specific for Turkic languages.

The search system of the Corpus makes it possible to conduct search by lemma (lexeme), by word form, as well as by a set of morphological parameters specified by the user. The system also supports search of stop words, search by part of the word, and search based on use of logical formulae. Thus the user can make a sophisticated inquiry – for instance, in order to come up with various types of collocations.

The Corpus texts collection contains a considerable set of linguistic data which allows linguists to empirically test their hypotheses and rules they formulate. The system makes it possible to get a substantial volume of empirical research data processed in a matter of seconds while taking into consideration the user’s demands.

The Corpus provides contexts which allow the researcher to examine the senses of lexemes in detail:
- to specify to what extent the word definition given in a dictionary is full or correct;
- to check if the word senses provided by the dictionary are correct;
- to identify new words and word senses which were not traced in the dictionary;
- to identify free and bound word senses;
- to identify typical environments where the word can be encountered, etc.

The Corpus makes it possible to receive reliable data on how grammar categories of the Tatar language are distributed:
- on morpheme frequency;
- on frequency of affixal chains;
- on frequency of the given word form in certain collocations, etc.

The volume of the Corpus guarantees data typicality and ensures completeness of representation of the whole range of linguistic phenomena, which is crucial in compiling dictionaries.
