Adversarial Training of Reward Models

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary reward models (RMs) exhibit poor out-of-distribution (OOD) robustness, frequently assigning high rewards to low-quality responses—leading to reward hacking and undermining alignment stability. To address this, we propose Adv-RM, the first framework to systematically integrate adversarial training into reward modeling. Adv-RM employs reinforcement learning to automatically discover vulnerabilities in the Nemotron-340B RM, generating high-reward–low-quality adversarial examples that are then incorporated into RM retraining. Crucially, it requires no hand-crafted shortcut signals and enables end-to-end identification and mitigation of RM decision flaws. Experiments demonstrate that Adv-RM significantly improves RM OOD robustness and discriminative consistency on both synthetic and real-world datasets, enhances RLHF training stability, and effectively suppresses reward hacking. This work establishes a novel, scalable paradigm for building robust, aligned language models.

Technology Category

Application Category

📝 Abstract
Reward modeling has emerged as a promising approach for the scalable alignment of language models. However, contemporary reward models (RMs) often lack robustness, awarding high rewards to low-quality, out-of-distribution (OOD) samples. This can lead to reward hacking, where policies exploit unintended shortcuts to maximize rewards, undermining alignment. To address this challenge, we introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples -- responses that receive high rewards from the target RM but are OOD and of low quality. By leveraging reinforcement learning, Adv-RM trains a policy to generate adversarial examples that reliably expose vulnerabilities in large state-of-the-art reward models such as Nemotron 340B RM. Incorporating these adversarial examples into the reward training process improves the robustness of RMs, mitigating reward hacking and enhancing downstream performance in RLHF. We demonstrate that Adv-RM significantly outperforms conventional RM training, increasing stability and enabling more effective RLHF training in both synthetic and real-data settings.
Problem

Research questions and friction points this paper is trying to address.

Improving robustness of reward models against adversarial examples
Preventing reward hacking in language model alignment
Enhancing RLHF performance through adversarial training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training framework for reward models
Generates adversarial examples via reinforcement learning
Improves robustness and mitigates reward hacking
🔎 Similar Papers
No similar papers found.