ASSESSMENT OF LARGE LANGUAGE MODELS FOR THE GENERATION OF EVALUATION RESOURCES

The use of large language models (LLMs) for the automatic generation of multiple-choice questions is emerging as a prominent area of educational innovation, with potential to support both teaching and student self-assessment. However, empirical evidence on the validity and reliability of LLM-generated questions remains limited. This study systematically analyzes the quality of questions produced by an LLM across six Computer Science courses to assess their suitability for both formal assessment and self-learning activities.

The question generation process relies on a metaprompting approach, whereby the LLM autonomously determines the optimal prompt for producing questions. It is guided by predefined instructions provided by teachers outlining the pedagogical criteria that the generated questions must satisfy. Using the teaching materials of each course, the LLM generates 60 questions balanced across three difficulty levels (basic, intermediate, and advanced) and aligned with the cognitive skills defined in Bloom’s taxonomy. Each question includes a clear stem, several options with a single correct choice, and metadata indicating its syllabus location, topic, difficulty, and Bloom category. The model must rely exclusively on the content provided, avoiding external knowledge or verbatim repetition of examples. Additional rules ensure conceptual diversity and prevent redundancy.

An ad hoc rubric was developed to evaluate the questions, focusing on key dimensions of pedagogical quality and accuracy (conceptual correctness, alignment with course content, and coherence between cognitive level and difficulty). It also considers undesirable features such as triviality, redundancy, or ambiguity, and identifies critical errors (e.g., questions outside the syllabus or with incorrect answers) that compromise validity. This expert evaluation serves as the basis for both quantitative and qualitative analyses assessing the reliability and educational usefulness of the automatically generated questions.

Results indicate that the LLM performs well in estimating question difficulty and Bloom’s taxonomy levels, suggesting a capacity to interpret the conceptual structure of the instructional material. However, several limitations compromise the validity of a substantial portion of the generated questions. Frequent issues include incorrect answers, questions outside the syllabus, and overly trivial formulations. A clear relationship was observed between cognitive complexity and invalidity, as higher Bloom levels were more prone to contain errors. Finally, the LLM shows a tendency to produce repetitive or highly similar questions when generating large volumes, reducing the overall diversity of the question bank.

Overall, the findings indicate acceptable performance in certain aspects but also reveal limitations that affect question validity. Teachers possess the expertise to discard invalid items, allowing them to use LLMs effectively as support tools for building question banks. Students, however, tend to rely on LLM outputs uncritically; thus, their use for self-assessment may entail significant risks. There is a clear need to establish guidelines for the safe and pedagogically sound use of LLMs and to incorporate verification mechanisms to ensure their effective integration into formative assessment practices in higher education.

Keywords:

Large Language Models (LLMs), Computer Science Education, Automated Question Generation, Self-Assessment, Formative Evaluation.

About this paper:

Abstract:

Keywords:

Citation