RRM: Robust Reward Model Training Mitigates Reward Hacking

📅 2024-09-20

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing reward models (RMs) are susceptible to prompt-irrelevant artifacts—such as response length and formatting—leading to “reward hacking” and hindering accurate modeling of genuine human preferences. To address this, we propose the first causal-inference–driven RM training paradigm, introducing a causal disentanglement framework that explicitly separates true preference signals from spurious artifacts. We design counterfactual data augmentation and contrastive preference learning to mitigate format- and length-induced biases, and integrate DPO-based alignment training for robust optimization. Experiments demonstrate substantial improvements: +3.54 points in RewardBench accuracy (80.61 → 84.15), +1.04 in MT-Bench score (7.27 → 8.31), and +19.03 percentage points in AlpacaEval-2 length-controlled win rate (33.46% → 52.49%). These results confirm significantly enhanced generalization capability and alignment reliability of the proposed RM.

Technology Category

Application Category

📝 Abstract

Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.

Problem

Research questions and friction points this paper is trying to address.

Traditional RM training struggles with artifacts.

Causal framework learns preferences independently.

Robust reward model improves performance metrics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal framework for preference learning

Data augmentation to eliminate artifacts

Robust reward model enhances performance

🔎 Similar Papers

No similar papers found.