🤖 AI Summary
This study systematically evaluates the reliability of large language models (LLMs) in autonomously generating diagnostic feedback for reasoning errors in programming learning. We construct a fine-grained benchmark comprising 45 authentic student submissions, annotated manually and evaluated quantitatively across three dimensions—accuracy, completeness, and actionability—for GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro. Results show an overall accuracy-and-completeness rate of 63%, with primary failure modes including misaligned line-number references and factual hallucinations; although GPT-4o achieves the highest performance, it still exhibits significant explanatory deficiencies. This work is the first to empirically delineate the risk boundaries of LLMs in pedagogical feedback for reasoning errors. Its core contribution is the establishment of the first evaluation framework specifically designed for assessing LLM-generated educational feedback on reasoning errors—quantifying both the current capability ceiling and prevalent failure patterns, thereby demonstrating that LLMs remain insufficient to replace human instructors in this critical educational task.
📝 Abstract
Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.