K. Kotani1, T. Yoshimi2, T. Kutsumi3, I. Sata3, H. Isahara4

1Kansai Gaidai University (JAPAN)
2Ryukoku University (JAPAN)
3Sharp Corporation (JAPAN)
4National Institute of Information and Communications Technology (JAPAN)
Supporting evaluation of language proficiency is an advantage of computer-assisted language learning. As foreign language learners vary greatly in their proficiency, it is important that a teacher finds what problems each learner has. In a reading class, a teacher can evaluate reading proficiency with comprehension questions in a reading textbook. Because there are no questions in authentic texts, a teacher has to prepare them if authentic texts are used as reading materials. Evaluation of learners’ comprehension therefore would put heavy burden on teachers.

Natural language processing technology can assist teachers in evaluating learners’ reading proficiency, because it provides an evaluation method that does not rely solely on comprehension questions. We propose Reading Proficiency Model (RPM) that computes learners’ reading proficiency in terms of a score on the Test of English for International Communication (TOEIC). We constructed RPM with a regression, taking a TOEIC score as the dependent variable and linguistic properties of a text and a learner’s reading time as the independent variables.

Linguistic properties refer to text complexity arising from lexical, syntactic and discourse properties of a text. Lexical difficulty is measured with a morpho-lexical analyzer[4]. Syntactic complexity is derived with a syntactic parser[3], which produces a syntactic tree of an input text. Discourse complexity is defined with the number of anaphoric expressions. Reading time data was collected from 64 learners of English as a foreign language (EFL) who reported their TOEIC scores. Each learner read 7 or 14 texts selected from a TOEIC textbook. As a result, 451 instances of text reading time data were obtained.

RPM was constructed with 361 instances as training data for a regression by Support Vector Machines (well known machine learning algorithms that have high generalization performance), and verified with the remaining 90 instances. RPM marked an error rate of 17.5% in our experiment. We further examined our model by comparing other RPMs (N-model and S-model) that employed linguistic features proposed by previous studies[1], [2]. N-model was developed based on lexical items in particular constructions such as a relative clause[1]. S-model was constructed based on syntactic features and lexical features such as the height of a syntactic tree and the number of conjunctions[2]. Error rate of these models were 18.7% for N-model and 18.4% for S-model. From viewpoint of error reduction rate, the error rate of our model is lower than that of N-model by 4.9% (=(18.7-17.5)/18.7*100) and that of S-model by 6.4% (=(18.4-17.5)/18.4*100).

From these experiment results, we conclude that our RPM can contribute to assisting teachers in evaluating EFL learners’ reading proficiency.

[1] Nagata, R., et al. 2002. A method of rating English reading skill automatically: Rating English reading skill using reading speed. Computer & Education, Vol. 12. 99-103.
[2] Schwarm, S. E. et al. 2005. Reading level assessment using support vector machines and statistical language models. Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics. 523-530.
[3] Sekine, S. et al. 1995. A corpus-based probabilistic grammar with only two non-terminals. Proc. of the 4th International Workshop on Parsing Technologies. 216-223.
[4] Someya, Y. 2000. Word Level Checker: Vocabulary Profiling Program by AWK, Ver. 1.5.