Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the vulnerability of reward models trained with human feedback in reinforcement learning to reward hacking, which can cause policy performance to plateau or degrade. The authors propose a lightweight robust optimization method that analyzes adversarial perturbations in the reward model’s parameter space and introduces the concept of an “advantage sign certification radius.” Relying solely on a single reward model, this approach guarantees the invariance of advantage signs during policy gradient updates and dynamically downweights non-robust generations. Evaluated on the TL;DR summarization task and the AlpacaFarm benchmark, the method significantly outperforms existing baselines, achieving higher win rates and effectively mitigating reward hacking.

Technology Category

Application Category

📝 Abstract

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

reinforcement learning from human feedback

reward models

advantage sign

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking

advantage sign robustness

RLHF