Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study systematically evaluates the reliability of large language models (LLMs) in autonomously generating diagnostic feedback for reasoning errors in programming learning. We construct a fine-grained benchmark comprising 45 authentic student submissions, annotated manually and evaluated quantitatively across three dimensions—accuracy, completeness, and actionability—for GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro. Results show an overall accuracy-and-completeness rate of 63%, with primary failure modes including misaligned line-number references and factual hallucinations; although GPT-4o achieves the highest performance, it still exhibits significant explanatory deficiencies. This work is the first to empirically delineate the risk boundaries of LLMs in pedagogical feedback for reasoning errors. Its core contribution is the establishment of the first evaluation framework specifically designed for assessing LLM-generated educational feedback on reasoning errors—quantifying both the current capability ceiling and prevalent failure patterns, thereby demonstrating that LLMs remain insufficient to replace human instructors in this critical educational task.

Technology Category

Application Category

📝 Abstract

Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for automated feedback in programming education

Assess LLMs' ability to identify reasoning errors in student code

Analyze accuracy and limitations of LLM-generated feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs for automated feedback in programming education

Assesses accuracy of feedback on student code reasoning errors

Highlights need for improvements in LLM reliability

🔎 Similar Papers

No similar papers found.