Process Rewards with Learned Reliability

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses a critical limitation of existing Process Reward Models (PRMs)—their provision of only point-wise step rewards without quantifying prediction reliability, which can lead to the misuse of untrustworthy signals in downstream tasks. To remedy this, the authors propose BetaPRM, the first approach to explicitly model the reliability of process rewards by jointly learning the success probability and its associated uncertainty for each reasoning step via a Beta-Binomial likelihood derived from Monte Carlo rollouts. Furthermore, they introduce an Adaptive Computation Allocation (ACA) strategy that dynamically optimizes inference-time resource expenditure. Experiments across four backbone models and four reasoning benchmarks demonstrate that BetaPRM significantly enhances both selection performance under PRM guidance and error detection capability. Moreover, ACA reduces token consumption by up to 33.57% compared to a fixed-budget Best-of-16 baseline while simultaneously improving final answer accuracy.

📝 Abstract

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

step-level feedback

reward reliability

distributional prediction

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

BetaPRM

distributional reward modeling

reliability estimation