Unifying Stable Optimization and Reference Regularization in RLHF

📅 2026-02-12

📈 Citations: 1

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses two key challenges in reinforcement learning from human feedback (RLHF)—reward hacking and unstable policy optimization—as well as the implicit trade-off inherent in existing approaches that jointly regularize using both reference and current policies. To resolve these issues, the authors propose a unified regularization objective that explicitly balances robustness against reward hacking with policy update stability. Notably, this is the first method to integrate reference model regularization and policy stability within a single coherent framework, achieving improved alignment through a weighted supervised fine-tuning loss. Experimental results demonstrate that the proposed approach significantly outperforms conventional RLHF and online preference learning methods across multiple benchmarks, exhibiting superior alignment performance and enhanced training stability.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($\pi_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($\pi_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $\pi_0$ and $\pi_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

stable optimization

RLHF

regularization

policy alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified regularization

reward hacking

stable optimization