DEFINING NORMALIZED PROMPTS FOR AI CHATBOT TESTING IN PROGRAMMING TASKS

M. Mikac; R. Logožar; M. Horvatić

doi:10.21125/inted.2026.0188

DIGITAL LIBRARY

DEFINING NORMALIZED PROMPTS FOR AI CHATBOT TESTING IN PROGRAMMING TASKS

M. Mikac

R. Logožar

M. Horvatić

University North (CROATIA)

About this paper:

Appears in: INTED2026 Proceedings
Publication year: 2026
Article: 0188
ISBN: 978-84-09-82385-7
ISSN: 2340-1079
doi: 10.21125/inted.2026.0188

Conference name: 20th International Technology, Education and Development Conference
Dates: 2-4 March, 2026
Location: Valencia, Spain

Abstract:

In our ongoing research on AI chatbot performance in solving programming tasks, we have observed that most models are capable of generating syntactically correct and executable code, yet their outputs often differ considerably in structure, length, and optimization. These inconsistencies make direct comparison of results difficult and limit the evaluation objectivity.

When tested on representative C/C++ and JavaScript problems, models such as ChatGPT, Claude, Copilot, DeepSeek, and Gemini produced functional solutions, but with noticeable differences in variable naming, code verbosity, and algorithmic efficiency. To reduce such discrepancies, we propose an approach based on normalized prompt templates. Each programming task is accompanied by constraints instructing the chatbot to generate code without comments, to prefer mathematically optimized over brute-force methods, and to minimize code length. We acknowledge that such normalization may not always follow best “clean code” practices; however, for the purpose of objective comparison, these aspects are intentionally disregarded. In practical or educational contexts, such information can later be retrieved or refined through additional interaction with the chatbot, serving teaching, documentation, or tutoring purposes.

This approach aims to establish a consistent framework for evaluating chatbot-generated code not only in terms of correctness and functional execution, but also regarding length, algorithmic complexity, and implementation style. Testing indicates that such a standardized prompt structure can reduce output variability and reveal characteristic tendencies in model behavior. While further systematic evaluation is required, this approach could provide a foundation for developing more objective and comparable methods for assessing the programming capabilities of AI chatbots.

Keywords:

AI, chatbot, programming, normalization, ChatGPT, Claude, Copilot, DeepSeek, Gemini, C/C++, JavaScript, prompt engineering.

About this paper:

Abstract:

Keywords:

Citation