🤖 AI Summary
This work identifies a high-variance problem in reward model (RM) training for large language model (LLM) alignment: independently trained RMs on the same preference dataset exhibit substantial disagreement in ranking predictions, leading to error-prone and performance-degrading policy optimization when relying on a single RM. To address this, we propose Variance-Aware Policy Optimization (VAP), the first framework to incorporate RM prediction variance as an uncertainty measure directly into the reinforcement learning objective via a variance-regularized term that constrains policy updates. We theoretically prove that VAP reduces the risk of generating low-quality responses. Empirically, across multiple LLMs and RM configurations, VAP significantly improves alignment stability and robustness compared to standard RLHF pipelines, demonstrating consistent gains in both offline evaluation and online deployment settings.
📝 Abstract
Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new policy regularizer that incorporates reward model variance estimates. We show that variance-aware policy optimization provably reduces the risk of outputting a worse policy than the default. Experiments across diverse LLM and reward model configurations confirm that our approach yields more stable and robust alignment than the standard (variance-unaware) pipeline.