LLMs cannot spot math errors, even when allowed to peek into the solution

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) to identify the first erroneous step in students’ mathematical solutions—a meta-reasoning task requiring error localization without re-solving the problem, solely by comparing the student’s original solution against a reference answer. While LLMs excel at mathematical problem-solving, their accuracy on this fine-grained diagnostic task remains notably low. To address this, we propose *Intermediate Correction Generation* (ICG): a method that prompts the LLM to generate intermediate solution versions that progressively correct the student’s reasoning while preserving its structural and conceptual proximity to the original attempt. By aligning and contrasting the original and corrected traces, ICG enhances the precision of first-error localization. Experiments on two benchmark datasets—VtG and PRM800K—demonstrate that ICG substantially outperforms direct classification baselines, improving first-error identification accuracy by up to 12.7%. This establishes a novel paradigm for LLM-based intelligent assessment and pedagogical feedback.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle to locate first error in math solutions

Models fail even with access to reference solutions

Proposing corrected intermediate solutions to improve accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates intermediate corrected student solution

Aligns closely with original student solution

Improves error location performance

🔎 Similar Papers

Large Language Models Are Struggle to Cope with Unreasonability in Math Problems

2024-03-28Citations: 4

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

AI Research Scientist, VLM (vision language models)