Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In RLHF, excessive reward optimization—termed “reward hijacking”—severely undermines alignment stability, primarily due to reward model misgeneralization (i.e., fitting spurious features) and insufficient regularization in policy optimization. To address this, we propose InfoRM: the first framework to incorporate the information bottleneck principle into reward modeling, revealing that reward hijacking manifests as anomalous distributional shifts in the latent space. Building on this insight, we design IBL—a distribution-level regularization method—and provide its theoretically equivalent interpretation under pessimistic reinforcement learning. Furthermore, we introduce MOP, a differentiable statistical metric that quantifies hijacking severity, enabling online detection, hyperparameter tuning, and early stopping. Extensive evaluation across multiple LLMs and benchmark datasets demonstrates that InfoRM significantly mitigates reward over-optimization, enhancing both policy stability and generalization performance.

Technology Category

Application Category

📝 Abstract
Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking and over-optimization in RLHF alignment
Mitigates reward misgeneralization by filtering irrelevant preference information
Introduces regularization to prevent policy space over-restriction during optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Bottleneck principle filters irrelevant reward information
Distribution regularization prevents reward hacking deviations
Mahalanobis distance metric quantifies reward hacking severity
🔎 Similar Papers
No similar papers found.
Yuchun Miao
Yuchun Miao
School of Computer Science, Wuhan University
Image ProcessingRemote SensingLarge Language ModelRLHFMachine Learning
L
Liang Ding
School of Computer Science, Faculty of Engineering, The University of Sydney, Australia
S
Sen Zhang
TikTok (ByteDance), Sydney, Australia
Rong Bao
Rong Bao
PhD student, Fudan University
AlignmentGenerative AIReinforcement Learning
Lefei Zhang
Lefei Zhang
School of Computer Science, Wuhan University
Pattern RecognitionMachine LearningImage ProcessingRemote Sensing
D
D. Tao
College of Computing & Data Science at Nanyang Technological University, #32 Block N4 #02a-014, 50 Nanyang Avenue, Singapore 639798