MACHINE GENERATED TEXT DETECTION WITH PRE-TRAINED LANGUAGE MODEL AND LEXICAL DIVERSITY

P.N. Méndez-Zavaleta; A. López-López

doi:10.21125/inted.2026.1576

DIGITAL LIBRARY

MACHINE GENERATED TEXT DETECTION WITH PRE-TRAINED LANGUAGE MODEL AND LEXICAL DIVERSITY

P.N. Méndez-Zavaleta

A. López-López

Instituto Nacional de Astrofísica Óptica y Electrónica (MEXICO)

About this paper:

Appears in: INTED2026 Proceedings
Publication year: 2026
Article: 1576
ISBN: 978-84-09-82385-7
ISSN: 2340-1079
doi: 10.21125/inted.2026.1576

Conference name: 20th International Technology, Education and Development Conference
Dates: 2-4 March, 2026
Location: Valencia, Spain

Abstract:

The robust detection of Machine-Generated Text (MGT) has become a critical research area due to the indistinguishability of language produced by Large Language Models (LLMs). This challenge directly threatens academic integrity and the reliability of student assessments across all educational levels. This work addresses the dilemma of whether the semantic representations of a RoBERTa classifier are sufficient for this crucial task or if they require enrichment with lexical diversity features, under the hypothesis that MGT possesses a less rich vocabulary. We compared a Base Model (a fine-tuned RoBERTa-base with a Multilayer Perceptron (MLP) classification head) with an Enriched Model (the Base Model enhanced with lexical diversity metrics). Using the SemEval-2024 Task 8 corpus, the lexical feature Measure of Lexical Textual Diversity (MTLD) initially improved the standard Base Model's performance from 0.80 to 0.85 in Accuracy. However, the Optimized Base Model—which underwent hyperparameter tuning on the MLP classification head—achieved a superior performance of 0.88 in Accuracy. Crucially, the optimization process did not further improve the Enriched Model, suggesting that the lexical metric becomes redundant within a finely calibrated system. We conclude that hyperparameter optimization is the most determining strategy for achieving state-of-the-art MGT detection capabilities, enabling educational institutions to implement more effective and robust plagiarism detection strategies against advanced AI generation.

Keywords:

Artificial Intelligence, Pre-trained Language Model, Lexical Diversity, Machine-Generated Text Detection.

About this paper:

Abstract:

Keywords:

Citation