CAN LANGUAGE PROCESSING ALGORITHMS REPLACE MANUAL ASSESSMENT OF PRIMARY SCHOOL STUDENTS' SHORT ANSWERS? 
Higher School of Economics (RUSSIAN FEDERATION)
About this paper:
            
          Conference name: 16th International Conference on Education and New Learning Technologies
Dates: 1-3 July, 2024
Location: Palma, Spain
 
             Abstract:
Introduction:
Reading involves cognitive functions, language proficiency, and prior knowledge. Open-ended items assess understanding best but are human-rated and time-consuming. Taking time constraints and budget limits into account, can natural language processing algorithms (NLPs) aid or replace manual assessors?
Method and Data:
Our research is based on the Progress test for measuring the reading literacy of 4th grade students. The test consists of three text fragments connected by a common plot, followed by questions. All questions in the test are closed-ended, except for two, the scoring one of which is described in the study. The data included responses from about 4,500 students. There are two models that are to be designed and implemented. To train models we used 75% of the data.  
Random Forest Classifier:
Random Forest Classifier is a machine learning method that constructs multiple decision trees to output class predictions (Breiman, 2001). It reduces overfitting and improves generalisation by using random subsets of training data and features. The final output is an aggregation of individual tree predictions, making it robust and accurate for various applications. The method’s main advantage is its effectiveness in handling noisy data, which is crucial in our case as we deal with responses from primary school students that can be noisy. To assess primary school students’ answers, we created a model using such features as the length of the answer and the time it took students to read the text. 
LLM:
Large Language Models (LLMs) are a branch of neural networks designed for natural language processing. There are quite a few well-known LLMs like chat GPT, BERT, Llama, etc (Tornberg, 2023). These models are trained on large corpuses of data, which accounts for their recent success: taking an already pre-trained model speeds up the training and fine-tuning process. In our research, we took a Russian language BERT model (Kuratov & Arkhipov, 2019), in order to assess primary students’ works.
 
Results:
The weighted F1-score on the test sample was 0.924 for Random Forest and 0.95 for the neural network, and the proportion of false negative predictions was less than 1%. The agreement of the two models (Cohen's Kappa) is 0.7 (Cohen, 1960). The results of our work illustrate that it is feasible to deploy an automated scoring of short texts using a neural networking model coupled with a Random Forest algorithm. Such approach to test assessment can largely enhance the process, hastening the checking procedures and lowering the costs.
References:
[1] Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324 
[2] Kuratov, Y., & Arkhipov, M. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. ArXiv, abs/1905.07213.
[3] Törnberg, Petter & Valeeva, Diliara & Uitermark, Justus & Bail, Christopher. (2023). Simulating Social Media Using Large Language Models to Evaluate Alternative News Feed Algorithms.  10.48550/arXiv.2310.05984  
[4] Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20 37-46Keywords:
 Reading literacy, primary school, neural networks, natural language processing, large language models, random forest.