T.H. Jen, K.M. Chen, C.D. Lee, H.H. Fu, C.Y. Chang

National Taiwan Normal University (TAIWAN)
The testing style of multiple-choice has become prevalent among high-stake examinations due to its easiness and fairness for scoring, but whether all abilities can be assessed by fixed answers or by closed-form solution has been under debate for years. Other forms of testing such as open-ended questions, hands-on tasks, free response essay questions, etc. are thus adopted to comprehensively assess examinees’ abilities of high-level thinking, scientific reasoning, or argumentation skills, and however, inter-rater reliability and time-efficiency of scoring need to be considered for these forms of testing. As well-defined guidelines for scoring increases the inter-rater reliability and the efficiency of computer science and information technology decreases the scoring time, an automated scoring system to integrate both of them ensures a more reliable and time-saving scoring. The inter-rater reliability has been used to indicate the degree of agreement on the guideline for scoring among raters. However, different raters weight certain information of a response differently, especially when the response cannot fit in any scoring criterion, causing different scores for the same response across raters. Usually such variability among raters is taken as error variance, and the more the raters’ scoring results, the less the error variance and the better the ability estimate by their average. Lee, Jen, and Chang (Accepted) proved such idea by comparing the reliability among four automated scoring systems. Three of them were trained separately by three human raters’ scores, and the fourth one was trained by the average scores of the three human raters. Accordingly, the fourth system made the most reliable prediction. By separating randomly the same examinees as in Lee et al.’s study into two groups in the current study, the scores of one group were obtained from the three human raters and their average, from the four automated scoring systems, and from the information of the IRT scores on other test items. For the other group, their scores from the four automated scoring systems and the IRT scores remained whereas the ones from the three human raters and their average were treated as missing data. A system trained by two groups implementing the missing value analysis module in SPSS 19 and the expectation maximization algorithm was used to impute the scores of the three human raters and their average for the second group. The imputed scores were then treated as the adjusted scores and compared with the scores predicted by the four automated scoring systems. The 10-fold cross validation method was used to replicate the procedure of random separation for the two groups and to estimate the standard errors of the reliabilities of the predicted scores. The results suggested that for each one of the four automatic scoring systems the adjusted scores are more reliable and have less biases across the score points than the unadjusted scores predicted by the automatic scoring system.