A CORPUS BASED STATISTICAL ANALYSIS OF THE PECULIAR LEXICON OF ITALIAN WRITTEN AND SPOKEN ACADEMIC DISCOURSE THROUGH THE USAGE OF TWO PARAMETERS: FREQUENCY AND KEYNESS
The contents of the present research are based on the real learning and linguistic needs registered among non-native students, who have to become more and more aware of knowing how to do and how to communicate something in an academic daily situation.
Our aim is to spread , through computational methods based on the usage of corpora and statistical tools, the Italian spoken and written academic lexicon. In fact, the academic lexicon is tightly linked to the learning and didactic activities with which students daily come in contact with during lessons, exams, conferences and so on. A right and immediate decoding of some lexical units such as contesto (context), teoria (theory), approccio (approach), or some articulate expressions such as introdurre un concetto (to introduce a concept) or trattare un argomento (to deal with a topic), is a fundamental step to proceed successfully throughout the academic career and then to get into the working world. This is concerned first of all with L2 students, that have to discover a different social and linguistic context. The immediate access to the meaning of a word allows students to focus on the content of what they try to explain, rather than on the way to explain that concept.
Both the frequence with which certain words are used in a text, and their salience in the definition of its meaning, can bring important cues about it and also about its author, because the choices made are never fortuitous (Archer 2009).
After having developed both a corpus for Italian spoken and written academic lexicon, our aim has been to realize two frequency lists of the Italian non-technical words, widely used in the academic written and oral communication. The considered corpora are composed of over one million words, belonging to different subject areas, textual typologies and communicative situations. The lexical units extracted from them, are then ordered by frequency and selected by a statistical measure of dispersion within the different above-mentioned areas.
The frequency of the lexical occurrences that constitute a corpus is a key element in all those tasks concerned with the recognition and the comprehension of words, both pronounced or written. This factor influences reading, writing and productive skills, as well as the processes of acquisition and development of a language, especially when we have to face with a particular and non generic genre, as that of the Academic Lexicon.
The lists are directly connected with the need of expand the academic lexicon of the non native students who learn Italian as second language at the University. They can be used, for example, to elaborate teaching materials to evaluate the degree of qualification of these students, or as starting point in the development of some Natural Language Processing applications, or furthermore to train and improve this kind of lexical domain inside an online learning environment.
With the aim of extracting relevant contextual cues, we have add to the over mentioned frequency parameter, the index of keyness, which allows to find the keywords of a text; that is to say those words that typify a text at its best. This project would evaluate the validity of these statistic measures, reflecting on their ability of interpreting and describing the linguistic context in which textual data are integrated, such as the academic one.