🤖 AI Summary
This study addresses the challenge of automatically scoring students’ self-explanations of problem-solving steps in programming education, formulating it as a binary classification task. It presents the first systematic comparison between large language models (LLMs) and traditional semantic similarity approaches in this specific educational context, supported by a newly constructed high-quality, domain-specific, and class-balanced dataset to enable rigorous evaluation. The results demonstrate that LLMs significantly outperform baseline methods in accurately identifying correct self-explanations, while also revealing their practical limitations and boundary conditions. This work provides empirical evidence and methodological insights for the development of intelligent tutoring systems in educational technology that rely on automated feedback mechanisms.
📝 Abstract
Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.