Towards Robust Process Reward Modeling via Noise-aware Learning

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses a critical issue in existing process reward models based on Monte Carlo estimation (MCE), wherein policy-dependent labeling introduces noise that leads to incorrect rewards for reasoning steps. The study is the first to identify this problem and proposes a two-stage denoising framework to mitigate it. Initially, a large language model (LLM) is leveraged to detect reflective and self-correction behaviors to rectify noisy labels. Subsequently, a noise-aware iterative training mechanism dynamically refines these labels based on model confidence. By integrating LLM-based adjudication with iterative denoising, the approach substantially outperforms baseline methods in step-level correctness evaluation, achieving an absolute F1 score improvement of up to 27% and significantly enhancing the robustness of process reward modeling.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

Label Noise

Monte Carlo Estimation

Step-level Correctness

Policy-dependent Rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-aware Learning

Process Reward Modeling

Label Correction