EVALUATING LLM‑GENERATED FEEDBACK FOR DEBUGGING ASSISTANCE IN CS1
1 Indian Institute of Technology Kanpur (INDIA)
2 Indian Institute of Information Technology Nagpur (INDIA)
About this paper:
Conference name: 20th International Technology, Education and Development Conference
Dates: 2-4 March, 2026
Location: Valencia, Spain
Abstract:
Providing timely and effective feedback on programming assignments is a critical challenge in introductory programming courses (CS1) having large enrollment. While Teaching Assistants (TAs) play a crucial role in helping students identify and understand logical errors in their code, the increasing class size makes it difficult to deliver personalized feedback to all students. Prior automated feedback systems often require manual intervention, rely on predefined error models, or generate generic responses that fail to address specific issues in students' code. Large Language Models (LLMs) present new opportunities for generating natural language feedback at scale, potentially enabling programmers to understand and fix logical errors without extensive manual intervention.
This paper compares LLM-generated feedback to human TA feedback for debugging assistance in CS1 courses. We generated feedback for buggy code samples using GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o. For systematic evaluation, we designed five prompt variants based on a common template that includes buggy code, the correct solution, the problem statement, failing manual test cases, and failing LLM-generated test cases. GPT-3.5 Turbo consistently produced misleading feedback due to hallucinations, and was excluded from subsequent analysis. For GPT-4 variants, we developed a systematic selection process to identify the most accurate feedback responses across prompt configurations.
We conducted a comprehensive three-phase user study involving 24 TAs and 80 undergraduate students from two educational institutions. In phase 1, two experienced TAs generated human feedback for seven buggy codes containing subtle logical bugs. In phase 2, the remaining 22 TAs rated seven feedback responses (five LLM-generated and two human-generated) for each problem on a 5-point scale assessing helpfulness in bug identification. In phase 3, students evaluated the highest-rated feedback through a before-and-after design measuring improvement in bug identification accuracy.
Our results reveal interesting insights into the effectiveness of LLM feedback as it achieved competitive ratings, ranking in the top two positions for 5 out of 7 problems evaluated. LLM feedback demonstrated superior detail and specificity in certain cases, explicitly identifying buggy statements and providing comprehensive explanations. However, LLM feedback occasionally exhibited overdiagnosis, identifying both accurate bugs and false positives, which could increase cognitive load and confuse novice programmers during debugging tasks.
Student learning outcomes showed significant differences between feedback sources. While both LLM and human feedback improved bug identification rates, human feedback produced superior learning outcomes, resulting in 2.5 times greater improvement in student bug identification accuracy compared to LLM feedback. Additionally, the institutional background significantly influenced both baseline debugging competency and the effectiveness of feedback utilization. These findings suggest that LLM-generated feedback can partially address scalability challenges in programming education but remains most effective when integrated with rather than replacing human TA guidance in CS1 courses. These results suggest that LLM-generated feedback can complement but not replace human guidance; strategic integration of both modalities offers a practical path toward scalable, quality feedback in CS1 education.Keywords:
CS1, CS education, automated feedback, LLM, logical errors.