DIGITAL LIBRARY
LEVERAGING LARGE LANGUAGE MODELS FOR QA GENERATION AND EVALUATION
1 State University of New York (SUNY) Old Westbury (UNITED STATES)
2 Loyola University Chicago (UNITED STATES)
About this paper:
Appears in: ICERI2024 Proceedings
Publication year: 2024
Page: 10143 (abstract only)
ISBN: 978-84-09-63010-3
ISSN: 2340-1095
doi: 10.21125/iceri.2024.2562
Conference name: 17th annual International Conference of Education, Research and Innovation
Dates: 11-13 November, 2024
Location: Seville, Spain
Abstract:
Question-answer generation (QAG) or question-answering (QA) focuses on creating answers to questions based on a specified context. From the days of using structured databases and constrained query formats to retrieve answers with only a narrow scope of information, QA systems now backed by machine learning and artificial intelligence are proving to be powerful tools in outputting meaningful context. Specifically, transformer-based pre-trained large language models (LLMs) are at the forefront of the discussion on natural language processing (NLP) tasks. One of the most popular uses of automatic answer retrieval presently is notably in the field of education. This study focuses on leveraging such language models to generate a question-answering system based on healthcare information, capable of receiving and evaluating user responses in real-time. This feeds into the larger project goal of establishing a bilingual intelligent tutoring system prototype geared towards educating low-literacy Hispanic breast cancer survivors on health and survivorship topics. Models like Hugging Face’s Instructor-xl, all-MiniLM-L6-v2 (MiniLM), GPT-3.5-turbo, and GPT-4o, as well as metrics like METEOR and BERTscore, were used for experimentation in generating and evaluating answers based on previously-generated health-related questions. Code-switching functionality was also incorporated into testing to determine how well the models could perform when accounting for both English and Spanish. For English answer generation, GPT achieved a 96.38% manual score, when evaluated for grammar, correctness, and meaningfulness based solely on health information transcriptions and questions. Spanish answer generation scored slightly lower at 86.11%, with more grammar and meaning errors. Within the different aspects of experimentation for this study, the usefulness and applicability of generative AI for virtual tutoring purposes was continually proven, and provides grounds for further discussion of how this constantly-emerging technology could be used to benefit diverse learning environments.
Keywords:
Question-answer generation, large language models (LLMs), evaluation.