Reward Model Overoptimisation in Iterated RLHF

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In Reinforcement Learning from Human Feedback (RLHF), reward model over-optimization severely degrades policy generalization; although iterative RLHF is widely adopted, its over-optimization dynamics remain poorly understood. Method: Leveraging the controllable AlpacaFarm benchmark, we conduct multi-round reward model retraining, policy re-optimization, and ablation studies. Contribution/Results: We first characterize the evolutionary pattern of over-optimization in iterative RLHF as monotonically decreasing yet exhibiting diminishing marginal returns. We find that reward models progressively converge toward true human preferences across iterations. Crucially, reinitializing the baseline policy—not fine-tuning from prior checkpoints—achieves the optimal trade-off between robustness and flexibility, whereas alternative initialization strategies fail to recover from early-stage over-optimization. These findings provide both theoretical grounding and empirical validation for diagnosing and designing iterative alignment methods.

Technology Category

Application Category

📝 Abstract

Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.

Problem

Research questions and friction points this paper is trying to address.

Studying reward model overoptimisation in iterated RLHF

Analyzing impact of design choices on overoptimisation dynamics

Evaluating performance trade-offs in policy initialisation strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterated RLHF mitigates reward overoptimisation

Reward model training data transfer across iterations

Policy initialisation strategies impact optimisation flexibility

🔎 Similar Papers

No similar papers found.