DIGITAL LIBRARY
CLUSTERING AND VISUALIZATION OF AUTHORS’ FEATURE USING WORD FORMATTING INFORMATION TO SUPPORT PLAGIARISM DETECTION IN CLASS ASSIGNMENT REPORTS
Osaka Sangyo University (JAPAN)
About this paper:
Appears in: EDULEARN23 Proceedings
Publication year: 2023
Pages: 3897-3901
ISBN: 978-84-09-52151-7
ISSN: 2340-1117
doi: 10.21125/edulearn.2023.1052
Conference name: 15th International Conference on Education and New Learning Technologies
Dates: 3-5 July, 2023
Location: Palma, Spain
Abstract:
COVID-19 has led to the rapid adoption of online classes at universities. Because of its convenience, students are often required to continue submitting reports electronically even after the resumption of face-to-face classes. The most popular application for creating electronic reports is Microsoft Word, which is widely used not only for report writing in educational institutions but also for writing papers in academic institutions. For example, many international conferences require paper manuscripts to be submitted as Word files.

Since Office 2007, Microsoft Word has adopted the .docx file, which is specified in accordance with Office Open XML. The .docx file entity is a package of multiple XML and image files compressed by ZIP into one.
By analyzing each XML file that makes up the .docx file, it is possible to obtain quantitative values for the document structure and appearance of a Word document.

The purpose of this study is to support teachers in detecting plagiarism and reduce their burden by acquiring information on the appearance of documents, such as boldface, underlines, and line spacing, which are often checked visually by teachers, and a vast amount of information on document structure, such as headings and charts, through XML analysis and using it as a feature vector.

In this paper, we report on an attempt to extract features based on document structure and appearance from actual report documents submitted in class, and apply the k-means method for clustering to visually present information useful for supporting teachers' visual checks.
Keywords:
Office Open XML, Word formatting information docx, Plagiarism detection, Scoring, Academic reports, k-means clustering, heatmap.