Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the issue of poorly calibrated process reward models (PRMs), which often overestimate success probabilities during inference, thereby degrading downstream task performance. The authors propose a novel calibration method based on conditional optimal transport (CondOT), the first to learn monotonic conditional quantile functions for structurally consistent recalibration of PRM outputs. This approach is integrated into an instance-adaptive scaling (IAS) framework, enabling reliable uncertainty estimation at arbitrary confidence levels. Combining strong theoretical guarantees with practical flexibility, the method significantly outperforms both uncalibrated PRMs and existing quantile regression techniques on the MATH-500 and AIME mathematical reasoning benchmarks, simultaneously improving calibration quality and Best-of-N IAS performance.

📝 Abstract

Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

calibration

overestimation

success probability

inference-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

conditional optimal transport

Process Reward Models

quantile calibration