🤖 AI Summary
Existing reward models (RMs) struggle to capture both aleatoric uncertainty—arising from inherent stochasticity in human preferences—and epistemic uncertainty—stemming from model’s limited knowledge and capacity—while lacking mechanisms to quantify reward prediction reliability. This work introduces the first unified two-level framework for jointly modeling both uncertainties, proposing a single-model variant (URM) and an ensemble-based variant (URME). URM employs a probabilistic value head and disentangled preference attribute modeling; URME further leverages inter-model discrepancy analysis for fine-grained uncertainty estimation and automatic identification of unreliable reward predictions. Evaluated on RewardBench, our methods significantly outperform state-of-the-art RMs. When integrated with Best-of-N sampling and iterative DPO/PPO optimization, they substantially enhance LLM generation quality, robustness to distributional shifts (e.g., across domains or prompts), and decision reliability—where low-uncertainty predictions consistently correlate with higher output quality and stronger generalization.
📝 Abstract
Reward models (RMs) are essential for aligning large language models (LLM) with human expectations. However, existing RMs struggle to capture the stochastic and uncertain nature of human preferences and fail to assess the reliability of reward predictions. To address these challenges, we introduce the Uncertainty-aware Reward Model (URM) and its ensemble variant, URME. URM employs a probabilistic value head to capture aleatoric uncertainty by modeling the distribution of disentangled human preference attributes. URME further quantifies epistemic uncertainty by examining discrepancies among individual URMs within the ensemble, enabling identification of unreliable evaluations. Our empirical evaluations demonstrate that URM achieves strong performance on RewardBench, outperforming competitive large-scale models. Additionally, extensive experiments, including best-of-n sampling (BoN), iterative direct preference optimization (iterative DPO), and proximal policy optimization (PPO), demonstrate that URM and URME significantly enhance LLMs' generation quality. Notably, reward predictions with lower uncertainty are far more reliable, demonstrate significantly higher quality, and result in substantially improved alignment.