🤖 AI Summary
This work investigates the capability of large language models (LLMs) and process reward models (PRMs) to detect errors in long-chain-of-thought (CoT) reasoning. To this end, we introduce DeltaBench—the first fine-grained benchmark specifically designed for long-CoT error detection—covering mathematical, coding, and general reasoning tasks. DeltaBench leverages multi-source long CoTs generated by o1-like models (e.g., QwQ, DeepSeek-R1) and employs human annotation to localize errors at the step level, enabling process-aware evaluation. Experimental results reveal that current PRMs achieve less than 35% detection accuracy for early-stage errors and exhibit over 60% false positive rates for later steps. Moreover, significant disparities exist across o1-like models in terms of reasoning length, error distribution, and error detectability. DeltaBench is publicly released to establish a new standard for evaluating trustworthiness in long-chain reasoning.
📝 Abstract
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.