🤖 AI Summary
This paper identifies a fundamental flaw in reward modeling for Reinforcement Learning from Human Feedback (RLHF): relying solely on reward model (RM) accuracy is insufficient for efficient policy optimization. Method: From an optimization perspective, the authors introduce reward variance as an independent, critical factor—low variance flattens the objective landscape and causes vanishing gradients, severely impeding convergence; remarkably, high-accuracy RMs may underperform low-accuracy but high-variance ones. They further establish variance compatibility between RMs and language models (LMs), showing that the same RM induces markedly different variances across LMs. Contribution/Results: The work provides the first theoretical proof of reward variance’s dominance in optimization and introduces a gradient-landscape modeling framework. Empirical validation across LM scales up to 8B parameters confirms a three-way trade-off among RM accuracy, reward variance, and optimization speed. Results demonstrate that moderate-accuracy, high-variance RMs significantly accelerate training, challenging the prevailing accuracy-centric evaluation paradigm.
📝 Abstract
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.