🤖 AI Summary
This work addresses a critical issue in existing process reward models based on Monte Carlo estimation (MCE), wherein policy-dependent labeling introduces noise that leads to incorrect rewards for reasoning steps. The study is the first to identify this problem and proposes a two-stage denoising framework to mitigate it. Initially, a large language model (LLM) is leveraged to detect reflective and self-correction behaviors to rectify noisy labels. Subsequently, a noise-aware iterative training mechanism dynamically refines these labels based on model confidence. By integrating LLM-based adjudication with iterative denoising, the approach substantially outperforms baseline methods in step-level correctness evaluation, achieving an absolute F1 score improvement of up to 27% and significantly enhancing the robustness of process reward modeling.
📝 Abstract
Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.