Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning, hand-crafted reward functions often misalign with true human objectives, leading to reward hacking. To address this, we propose a preference-based reward repair framework: given a small set of pairwise preference labels over state transitions, it iteratively optimizes a learnable, transition-specific additive correction term that fuses prior reward signals with human feedback; concurrently, a directed exploration strategy prioritizes correction at critical transitions. Theoretically, our method achieves a regret bound matching the state-of-the-art for comparable approaches. Empirically, we evaluate on multiple reward-hacking benchmarks and demonstrate that—using only a handful of preference annotations—our method significantly outperforms both from-scratch policy training and conventional reward redesign techniques, efficiently recovering near-optimal policy performance.

Technology Category

Application Category

📝 Abstract
Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.
Problem

Research questions and friction points this paper is trying to address.

Repairing misaligned reward functions to prevent reward hacking in RL agents
Learning additive correction terms from human preferences over trajectories
Reducing dataset costs while maintaining optimal policy performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repairs reward functions using human feedback
Learns additive correction term from preferences
Uses targeted exploration for efficient preference learning
🔎 Similar Papers
No similar papers found.