Radboud University (NETHERLANDS)
About this paper:
Appears in: EDULEARN21 Proceedings
Publication year: 2021
Pages: 8937-8945
ISBN: 978-84-09-31267-2
ISSN: 2340-1117
doi: 10.21125/edulearn.2021.1798
Conference name: 13th International Conference on Education and New Learning Technologies
Dates: 5-6 July, 2021
Location: Online Conference
Recent findings about the development of writing skills by Dutch children in primary school show an alarming decrease in spelling proficiency in 2019 with respect to 2009. These results raise concerns about the quality of spelling education and calls for innovative solutions. Insights into which spelling errors are most common and problematic would help design such solutions, but so far this type of large scale quantitative research has not been conducted.

A recently realized corpus of handwritten texts by elementary school children, BasiScript (Tellings et al., 2018), makes this kind of innovative, quantitative research possible. In the present study we use this corpus to address the following research question: which spelling principles are most frequently violated in texts written by sixth graders? Spelling principles are rules that children need to master to write Dutch texts flawless.

To answer this question, we developed an automatic spelling error detection and annotation algorithm that we applied to the BasiScript corpus. Present in BasiScript are digitized (i.e., typed) versions of the original handwritten texts, which include spelling errors (“hypothesis texts”), and corrected versions (“reference texts”). Together with word properties of the corrected texts, like lemmas, part-of-speech tags and morphemes, these two versions constitute the input to the algorithm.

The spelling error detection and annotation algorithm first aligns the reference and hypothesis texts, this alignment is split into words and spelling errors are then detected in these words. We define a spelling error as a Phoneme-Corresponding Unit (PCU) that is substituted, deleted or inserted. A PCU is a sequence of graphemes that corresponds to one phoneme (Laarmann-Quante, 2016). For example: huis (house) contains three PCUs: h, ui and s.

These spelling errors are then annotated with the spelling principle that is violated. Therefore, we use a mutually exclusive annotation scheme that is largely based on the one by Horbach-Kleijnen (1997) and adapted to the orthographic properties of Dutch. Using this scheme, we obtain two spelling error annotation layers: one for case-sensitive errors and one for case-insensitive errors. In addition, we add a third annotation layer describing for some PCUs in the reference which spelling principle should be applied to write them correctly. In this way, we gain insight into the percentage of times that a certain spelling principle is applied (in)correctly.

Preliminary results reveal that older children produce longer texts, that most errors in sixth grade fall in the syntax category and that, for most error categories, the number of errors decreases over time. Error categories for which this trend is not visible are hyphens, accented letters, compounds and verb suffixes. The present approach allows quantitative analyses of spelling proficiency that have so far not been possible. In the future, these algorithms could be employed in computer applications that provide detailed feedback that not only indicates which letters are incorrect, but also why they are incorrect.
BasiScript corpus, Spelling errors, Dutch language, Phoneme-Corresponding Units, Alignment, Spelling principles, Elementary school.