LLMs cannot spot math errors, even when allowed to peek into the solution

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capability of large language models (LLMs) to identify the first erroneous step in students’ mathematical solutions—a meta-reasoning task requiring error localization without re-solving the problem, solely by comparing the student’s original solution against a reference answer. While LLMs excel at mathematical problem-solving, their accuracy on this fine-grained diagnostic task remains notably low. To address this, we propose *Intermediate Correction Generation* (ICG): a method that prompts the LLM to generate intermediate solution versions that progressively correct the student’s reasoning while preserving its structural and conceptual proximity to the original attempt. By aligning and contrasting the original and corrected traces, ICG enhances the precision of first-error localization. Experiments on two benchmark datasets—VtG and PRM800K—demonstrate that ICG substantially outperforms direct classification baselines, improving first-error identification accuracy by up to 12.7%. This establishes a novel paradigm for LLM-based intelligent assessment and pedagogical feedback.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle to locate first error in math solutions
Models fail even with access to reference solutions
Proposing corrected intermediate solutions to improve accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates intermediate corrected student solution
Aligns closely with original student solution
Improves error location performance