On the Robustness of Reward Models for Language Model Alignment

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work identifies an over-optimization problem in Bradley–Terry reward modeling for RLHF, stemming from excessive dispersion of latent state norms—degrading out-of-distribution robustness. To address this, we propose Batch-wise Sum-to-Zero Regularization (BSR): it enforces zero-mean rewards within each batch while constraining latent state magnitudes, thereby achieving reward centering and scale control. BSR is the first method to explicitly identify latent norm dispersion as the primary cause of over-optimization and improves out-of-distribution generalization at the modeling level. On an 8B-parameter model, BSR achieves >5% higher accuracy on complex preference evaluation than prior SOTA reward models; on AlpacaEval 2.0, it reduces average generation length by 40% and increases win rate by 7%. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract

The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.

Problem

Research questions and friction points this paper is trying to address.

Over-optimization in reward models reduces generalizability to unseen data

Excessive dispersion of hidden state norms causes reward model instability

Robust reward models improve alignment in reinforcement learning from human feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Batch-wise sum-to-zero regularization improves robustness

Excessive hidden state norms cause over-optimization

Robust reward models enhance RLHF alignment

🔎 Similar Papers

HAF-RM: A Hybrid Alignment Framework for Reward Model Training