CLASSIFICATION OF LANGUAGE LEARNERS’ SENTENCES INTO NATIVE SPEAKER-LIKE OR NON-NATIVE SPEAKER-LIKE SENTENCES USING LEARNER SENTENCES AND MACHINE TRANSLATION SENTENCES AS TRAINING DATA
Teacher’s feedback helps second language (L2) learning, as it has been reported that L2 learners receiving error feedback show progress in written accuracy (Ferris 2006). Taking into account the importance of teacher’s feedback, a language teacher should identify what errors occur in L2 learners’ sentences. Since automatic evaluation reduces a burden on a teacher to check each learner’s composition, previous studies proposed automatic evaluation systems that classify L2 learner sentences either into native speaker-like (fluent and adequate) sentences, or non-native speaker-like (unnatural) sentences (Lee et al. 2007, Baroni and Bernardini 2006, Tomokiyo and Jones 2001, Kotani et al. 2008). Henceforth, this automatic evaluation system will simply be called a classifier. These studies adopted statistical methods or machine learning algorithms (Quinlan 1992, Vapnik 1998) for classification of L2 learner sentences.
Following these procedures, we have to prepare a large amount of L2 learner sentences and native speaker sentences as training data for constructing classifiers by machine learning algorithms. Even though native speaker sentences become easily available owing to the development of corpora, e.g., British National Corpus (http://www.natcorp.ox.ac.uk), it is still hard to obtain L2 learner sentences. Then, Lee et al. (2007) proposed to use machine translation (MT) sentences as alternative training data for L2 learner sentences. This is quite interesting, because MT sentences should involve linguistic errors as well as L2 learner sentences. Moreover, if we use MT systems, a large amount of MT sentences can be more easily available than L2 learner sentences.
In constructing a classifier for L2 learner sentences, it is important to choose proper classification features that could reveal the differences between native-like and non-native-like sentences. In this paper, our classifier examines Japanese sentences written by L2 learners based on the distribution of word-by-word translated expressions. These expressions can be identified by word alignment technique. The distribution of word alignment helps classify L2 learner sentences into native speaker-like or non-native speaker-like sentences.
In this paper, we examined whether MT sentences could be used as alternative training data for L2 learner sentences. Given this, we constructed automatic evaluation systems: one was constructed with L2 learner sentences as training data and the other was constructed with MT sentences as training data, and compared the validity of these systems. Through the experiment, we found that MT sentences differed from L2 learner sentences for the adequacy of training data. The classification accuracy of a classifier using MT sentences was 80.4%, and the one using L2 learner sentences was 98.7%. The MT sentence-based classifier is adequate but less effective than the L2 learner sentence-based classifier.