Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing process reward modeling (PRM) approaches oversimplify error attribution in chain-of-thought (CoT) reasoning by assuming full-chain failure after the first erroneous step, neglecting large language models’ inherent self-correction and reflective capabilities. Method: We propose a fine-grained PRM framework introducing two novel concepts—*error propagation* and *error cessation*—enabling precise discrimination of interleaved correct and incorrect reasoning steps. Leveraging an LLM-based automated annotation pipeline, we construct 1.7M high-quality process-level samples to train a 7B-parameter PRM supporting both solution-level and step-level evaluation. Results: Our method consistently outperforms open-source PRMs across search guidance, Best-of-N (BoN), and F1 metrics. Compared to Monte Carlo (MC) sampling-based annotation, it achieves higher data efficiency and stronger performance, while demonstrating robustness and strong cross-task generalization.

Technology Category

Application Category

📝 Abstract
Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps. Leveraging an LLM-based judger for annotation, we collect 1.7 million data samples to train a 7B PRM and evaluate it at both solution and step levels. Experimental results demonstrate that compared to existing open-source PRMs and PRMs trained on open-source datasets, our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores. Compared to widely used MC-based annotation methods, our annotation approach not only achieves higher data efficiency but also delivers superior performance. Detailed analysis is also conducted to demonstrate the stability and generalizability of our method.
Problem

Research questions and friction points this paper is trying to address.

Focuses on improving data annotation for Process Reward Models (PRMs)
Addresses bias in scoring long Chain-of-Thought (CoT) reasoning steps)
Enhances PRMs' ability to detect self-correction in reasoning processes)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel data annotation for long CoT reasoning
Introduces Error Propagation and Cessation concepts
Uses LLM-based judger for efficient annotation
🔎 Similar Papers
Z
Zhaohui Yang
Institute of Automation, Chinese Academy of Sciences
C
Chenghua He
Institute of Automation, Chinese Academy of Sciences
X
Xiaowen Shi
Meituan
L
Linjing Li
Institute of Automation, Chinese Academy of Sciences
Q
Qiyue Yin
Institute of Automation, Chinese Academy of Sciences
Shihong Deng
Shihong Deng
Bytedance Technology
Artificial Intelligence
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models