MACHINE GENERATED TEXT DETECTION WITH PRE-TRAINED LANGUAGE MODEL AND LEXICAL DIVERSITY
Instituto Nacional de Astrofísica Óptica y Electrónica (MEXICO)
About this paper:
Conference name: 20th International Technology, Education and Development Conference
Dates: 2-4 March, 2026
Location: Valencia, Spain
Abstract:
The robust detection of Machine-Generated Text (MGT) has become a critical research area due to the indistinguishability of language produced by Large Language Models (LLMs). This challenge directly threatens academic integrity and the reliability of student assessments across all educational levels. This work addresses the dilemma of whether the semantic representations of a RoBERTa classifier are sufficient for this crucial task or if they require enrichment with lexical diversity features, under the hypothesis that MGT possesses a less rich vocabulary. We compared a Base Model (a fine-tuned RoBERTa-base with a Multilayer Perceptron (MLP) classification head) with an Enriched Model (the Base Model enhanced with lexical diversity metrics). Using the SemEval-2024 Task 8 corpus, the lexical feature Measure of Lexical Textual Diversity (MTLD) initially improved the standard Base Model's performance from 0.80 to 0.85 in Accuracy. However, the Optimized Base Model—which underwent hyperparameter tuning on the MLP classification head—achieved a superior performance of 0.88 in Accuracy. Crucially, the optimization process did not further improve the Enriched Model, suggesting that the lexical metric becomes redundant within a finely calibrated system. We conclude that hyperparameter optimization is the most determining strategy for achieving state-of-the-art MGT detection capabilities, enabling educational institutions to implement more effective and robust plagiarism detection strategies against advanced AI generation.Keywords:
Artificial Intelligence, Pre-trained Language Model, Lexical Diversity, Machine-Generated Text Detection.