Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the mismatch between training and inference in implicit process reward models, which leads to unreliable token-level reward signals. To resolve this, the authors propose the Implicit Prefix Value Reward Model (IPVRM), which directly models a prefix-conditional value function to estimate the probability of ultimately generating a correct answer and leverages temporal difference advantage estimation to produce reliable step-level signals. Furthermore, they introduce Distributional Reinforcement Learning (DistRL), which performs dense counterfactual updates on both sampled and high-probability candidate tokens without requiring additional rollouts, enabling efficient optimization. Evaluated on ProcessBench, IPVRM significantly improves step-wise verification F1 scores, and when combined with DistRL, consistently enhances downstream reasoning performance.

Technology Category

Application Category

📝 Abstract

Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.

Problem

Research questions and friction points this paper is trying to address.

implicit reward models

train-inference mismatch

token-level credit assignment

reasoning step quality

process reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Reward Learning

Prefix-Value Function

Temporal-Difference Learning