Rectifying Shortcut Behaviors in Preference-based Reward Learning

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In preference-based reward learning, reward models are vulnerable to “reward hacking”—exploiting spurious shortcuts (e.g., response length, overly polite phrasing) rather than aligning with true human intent, leading to poor out-of-distribution generalization. This work is the first to formally unify such failures under the “shortcut behavior” problem. We propose PRISM, a principled framework grounded in kernel invariant learning: it constructs group-invariant kernel functions and feature mappings for preference data and solves for invariant reward modeling via closed-form optimization, systematically suppressing diverse shortcut dependencies. Experiments demonstrate that PRISM significantly improves reward model accuracy on out-of-distribution preferences and reduces downstream policy models’ reliance on spurious features—validating its robustness and generalization capability across diverse benchmarks.

Technology Category

Application Category

📝 Abstract
In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.
Problem

Research questions and friction points this paper is trying to address.

Mitigating shortcut behaviors in preference-based reward learning
Addressing reward hacking and poor generalization in RLHF
Reducing dependency on spurious features in human feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns group-invariant kernels for reward models
Mitigates shortcut behaviors via closed-form objective
Improves generalization on out-of-distribution tasks