K. Kotani1, T. Yoshimi2, M. Uchida1

1Kansai Gaidai University (JAPAN)
2Ryukoku University (JAPAN)
Automatic classification of texts written by learners of English as a foreign language (EFL learners) into several levels can lighten burden on language teachers by reducing time and effort for evaluation of the texts. Because texts classified into the lower level may contain more errors, language teachers can use these texts for intensively teaching linguistic knowledge that EFL learners need to know by correcting these errors.

Therefore, various methods for automatic classification of learner texts have been proposed. The previous classification methods determined the appropriateness of texts by quantitatively examining linguistic features such as distribution of adjective, adverb, noun, and verb phrases, because these features were known to differ between appropriate and inappropriate texts.

Most previous studies neglected learner features that present how an EFL learner wrote a text in terms of time spent for writing the text and confidence in the appropriateness of the text. However, these learner features are regarded as significant evidence for the writing proficiency (Izumi et al. 2005). Time will show the difficulty of texts for the EFL learners, because they need longer time for writing more difficult texts. Similarly, confidence will show the difficulty of the texts for the EFL learners, because they have less confidence in writing more difficult texts.

Given these backgrounds, this study develops a classification method of texts based on both linguistic features and learner features. The classification method is constructed by use of the discriminant analysis. The explanatory variables of the analysis are linguistic and learner features taken from the learner corpus of Kotani et al. (2011) in which 90 EFL learners wrote texts for describing a series of pictures (Hughes 2003).

The learner corpus compiled texts written by the EFL learners, time spent for writing the texts, and confidence in the appropriateness of the texts. The objective variable of the analysis is an integrated score of the adequacy and fluency of the text evaluated by a language teacher.

Experimental results of the classification method marked fairly high accuracy and better than random chance: 67.8% for binary classification, 43.3% for five-group classification, and 20.0% for ten-group classification in a leave-one-out (k-fold) cross-validation test.

Hence, the classification method is useful for assisting language teachers for finding inappropriate texts that include linguistic elements for which EFL learners need the relevant knowledge.

Hughes, A. Testing for Language Teachers. 2nd Edition. Cambridge University, 2003.

Kotani, K., Yoshimi, T. Nanjo, H. and Isahara, H. Compiling Learner Corpus Data of Linguistic Output and Language Processing in Speaking, Listening, Writing, and Reading, Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.1418-1422, 2011,

Izumi, E., Uchimoto, K. and Isahara, H. Error Annotation for Corpus of Japanese Learner English. Proceedings of 6th International Workshop on Linguistically Annotated Corpora, pp.71-80, 2005.