EXPLORING MULTI-LLM GENERATION AND AI-BASED QUALITY CONTROL FOR HIGH-LEVEL BPMN EXAM QUESTIONS
University of Applied Sciences and Arts Northwestern Switzerland FHNW (SWITZERLAND)
About this paper:
Conference name: 20th International Technology, Education and Development Conference
Dates: 2-4 March, 2026
Location: Valencia, Spain
Abstract:
Building on earlier work on automated question generation and AI-based quality assurance, this study investigates the creation of high-quality multiple-choice questions for a semester final exam on Business Process Model and Notation (BPMN). In contrast to previous research that focused on IPMA certification content, the present study targets the two highest cognitive levels of Bloom’s taxonomy—analysis and evaluation—where question quality is both critical and difficult to automate.
Three leading Large Language Models (LLMs) are systematically compared as generators of BPMN-related exam questions. Each model is tasked with producing structurally valid Multiple Choice (MC) items that require interpretation of BPMN models, identification of modelling inconsistencies, evaluation of modelling decisions, and selection of the most defensible answer among plausible distractors. A secondary, independent AI model then evaluates and, when needed, corrects each question. Its role is to detect misalignments with Bloom’s levels, logical flaws, ambiguous distractors, modelling inaccuracies, and structural issues related to exam readiness.
Following a full-factorial experimental design, all combinations of generators and reviewers are tested. This allows us to analyse both the standalone performance of each generator and the incremental improvement achieved through different reviewer systems. Evaluation metrics include:
- The proportion of questions that truly match Bloom level 5–6 after AI review.
- Improvements introduced by the reviewer (semantic, structural, taxonomic).
- Remaining manual interventions required by subject-matter experts.
- Differences in consistency and cognitive depth across generator–reviewer pairs.
Preliminary findings reveal substantial variation in the cognitive rigor and modelling precision of the generated questions. Some combin¬ations consistently achieve high Bloom levels with minimal expert revision, while others show recurrent weaknesses in distractor design, BPMN notation accuracy, or depth of required reasoning. The results highlight the potential and limitations of a multi-LLM pipeline for producing exam-grade, higher-order MC questions in the BPMN domain.
Future work will examine the use of these AI-generated question banks in authentic teaching settings, analysing student performance, perceived difficulty, and the pedagogical value of higher-order MC items in process modelling education.Keywords:
Generative AI, LLMs, BPMN, assessment, Bloom taxonomy, question generation, AI-based quality control.