🤖 AI Summary
This work investigates the capability of large language models (LLMs) to identify the first erroneous step in students’ mathematical solutions—a meta-reasoning task requiring error localization without re-solving the problem, solely by comparing the student’s original solution against a reference answer. While LLMs excel at mathematical problem-solving, their accuracy on this fine-grained diagnostic task remains notably low. To address this, we propose *Intermediate Correction Generation* (ICG): a method that prompts the LLM to generate intermediate solution versions that progressively correct the student’s reasoning while preserving its structural and conceptual proximity to the original attempt. By aligning and contrasting the original and corrected traces, ICG enhances the precision of first-error localization. Experiments on two benchmark datasets—VtG and PRM800K—demonstrate that ICG substantially outperforms direct classification baselines, improving first-error identification accuracy by up to 12.7%. This establishes a novel paradigm for LLM-based intelligent assessment and pedagogical feedback.
📝 Abstract
Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.