DIGITAL LIBRARY
HUMAN–AI COLLABORATION IN SOFTWARE DEVELOPMENT: A COMPARATIVE STUDY OF LLM-GENERATED CODE FOR EDUCATIONAL INNOVATION
Marist University (UNITED STATES)
About this paper:
Appears in: INTED2026 Proceedings
Publication year: 2026
Article: 0858
ISBN: 978-84-09-82385-7
ISSN: 2340-1079
doi: 10.21125/inted.2026.0858
Conference name: 20th International Technology, Education and Development Conference
Dates: 2-4 March, 2026
Location: Valencia, Spain
Abstract:
As AI-assisted programming tools enter both classrooms and professional practice, educators are confronting a pressing challenge: understanding how large language models (LLMs) actually reason through coding tasks and determining how these systems can be responsibly integrated into teaching and learning. While AI tools offer unprecedented support for debugging, code generation, and creative problem solving, their internal logic remains largely opaque. Students can produce functioning code without understanding why it works, and instructors struggle to evaluate the reliability of AI-generated solutions and model the reasoning processes that underlie them. This transparency gap complicates efforts to build meaningful AI literacy and raises pedagogical concerns about whether students are learning to think critically about code or simply learning to query a model.

This project addresses that gap by empirically examining how three leading LLMs—GPT, Gemini, and Claude—respond to the same set of programming challenges across algorithm design, debugging, documentation, simulation, and application development. All prompts were implemented in Python to ensure consistency and accessibility, and each model’s outputs were systematically collected, standardized, and human evaluated. By comparing model behavior side by side, this study offers insight into differences in reasoning patterns, levels of correctness, consistency, and the models’ varying degrees of dependence on human guidance.

The central focus of the project is the role of prompting and human-AI interaction. The study contrasts single-prompt instructions with incremental prompting in which goals are clarified, errors are corrected, and partial solutions are refined. The results demonstrate that LLM performance improves substantially when humans guide the reasoning process, suggesting that effective AI-supported learning is not an automated experience but a collaborative one. These findings highlight the importance of teaching students not only to query AI but to question it, verify its outputs, and engage critically with its reasoning.

To support transparency and reproducibility, the project produced a publicly accessible repository containing prompts, model outputs, evaluation notes, and metadata. This resource provides instructors with concrete examples they can use to teach AI reasoning, model comparison, and responsible debugging practices. It also creates a foundation for ongoing research on prompt engineering, error patterns, and the interactional dynamics between humans and LLMs.

By revealing how different models approach identical problems and how those approaches change under human direction, this project offers educators a clearer understanding of when AI tools enhance learning and when they obscure essential reasoning. Ultimately, the study strengthens the foundation for responsible GenAI integration in computing education and equips students with the critical skills needed to navigate an AI-mediated future in software development.
Keywords:
LLM reasoning and code generation, Prompt engineering and education, Human-AI collaboration in programming, AI model comparison, AI skills.