Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capability of large language models (LLMs) and process reward models (PRMs) to detect errors in long-chain-of-thought (CoT) reasoning. To this end, we introduce DeltaBench—the first fine-grained benchmark specifically designed for long-CoT error detection—covering mathematical, coding, and general reasoning tasks. DeltaBench leverages multi-source long CoTs generated by o1-like models (e.g., QwQ, DeepSeek-R1) and employs human annotation to localize errors at the step level, enabling process-aware evaluation. Experimental results reveal that current PRMs achieve less than 35% detection accuracy for early-stage errors and exhibit over 60% false positive rates for later steps. Moreover, significant disparities exist across o1-like models in terms of reasoning length, error distribution, and error detectability. DeltaBench is publicly released to establish a new standard for evaluating trustworthiness in long-chain reasoning.

Technology Category

Application Category

📝 Abstract
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
Problem

Research questions and friction points this paper is trying to address.

Evaluate error detection in long Chain-of-Thought reasoning
Assess performance of process reward models
Analyze effectiveness of o1-like models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DeltaBench for error detection
Analyzes long Chain-of-Thought reasoning effectiveness
Evaluates process reward and critic models
🔎 Similar Papers
No similar papers found.