🤖 AI Summary
Existing Automated Programming Assessment Systems (APAS) rely solely on predefined unit tests, yielding narrow, non-personalized feedback that inadequately supports learning. Method: This study comparatively evaluates three feedback modalities—compiler output, unit test results, and AI-generated feedback from large language models (LLMs)—and proposes a hybrid “unit test + AI-generated” feedback mechanism. A large-scale user study—including both quantitative performance metrics and qualitative survey data—was conducted to assess efficacy. Contribution/Results: While students subjectively preferred unit-test feedback, the AI-only group demonstrated significantly superior problem-solving performance (p < 0.01). The hybrid approach synergistically combines the precision of unit tests with the semantic richness and explanatory depth of LLMs, enhancing both feedback quality and learning outcomes. This work provides empirical evidence and a novel paradigm for developing personalized, high-utility automated feedback systems in programming education.
📝 Abstract
With the recent rapid increase in digitization across all major industries, acquiring programming skills has increased the demand for introductory programming courses. This has further resulted in universities integrating programming courses into a wide range of curricula, including not only technical studies but also business and management fields of study.
Consequently, additional resources are needed for teaching, grading, and tutoring students with diverse educational backgrounds and skills. As part of this, Automated Programming Assessment Systems (APASs) have emerged, providing scalable and high-quality assessment systems with efficient evaluation and instant feedback. Commonly, APASs heavily rely on predefined unit tests for generating feedback, often limiting the scope and level of detail of feedback that can be provided to students. With the rise of Large Language Models (LLMs) in recent years, new opportunities have emerged as these technologies can enhance feedback quality and personalization.
To investigate how different feedback mechanisms in APASs are perceived by students, and how effective they are in supporting problem-solving, we have conducted a large-scale study with over 200 students from two different universities. Specifically, we compare baseline Compiler Feedback, standard Unit Test Feedback, and advanced LLM-based Feedback regarding perceived quality and impact on student performance.
Results indicate that while students rate unit test feedback as the most helpful, AI-generated feedback leads to significantly better performances. These findings suggest combining unit tests and AI-driven guidance to optimize automated feedback mechanisms and improve learning outcomes in programming education.