Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

๐Ÿ“… 2026-04-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the vulnerability of reward models trained with human feedback in reinforcement learning to reward hacking, which can cause policy performance to plateau or degrade. The authors propose a lightweight robust optimization method that analyzes adversarial perturbations in the reward modelโ€™s parameter space and introduces the concept of an โ€œadvantage sign certification radius.โ€ Relying solely on a single reward model, this approach guarantees the invariance of advantage signs during policy gradient updates and dynamically downweights non-robust generations. Evaluated on the TL;DR summarization task and the AlpacaFarm benchmark, the method significantly outperforms existing baselines, achieving higher win rates and effectively mitigating reward hacking.
๐Ÿ“ Abstract
Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
reinforcement learning from human feedback
reward models
advantage sign
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking
advantage sign robustness
RLHF
policy optimization
sign-certified
๐Ÿ”Ž Similar Papers
No similar papers found.