Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

In reinforcement learning, hand-crafted reward functions often misalign with true human objectives, leading to reward hacking. To address this, we propose a preference-based reward repair framework: given a small set of pairwise preference labels over state transitions, it iteratively optimizes a learnable, transition-specific additive correction term that fuses prior reward signals with human feedback; concurrently, a directed exploration strategy prioritizes correction at critical transitions. Theoretically, our method achieves a regret bound matching the state-of-the-art for comparable approaches. Empirically, we evaluate on multiple reward-hacking benchmarks and demonstrate that—using only a handful of preference annotations—our method significantly outperforms both from-scratch policy training and conventional reward redesign techniques, efficiently recovering near-optimal policy performance.

Technology Category

Application Category

📝 Abstract

Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.

Problem

Research questions and friction points this paper is trying to address.

Repairing misaligned reward functions to prevent reward hacking in RL agents

Learning additive correction terms from human preferences over trajectories

Reducing dataset costs while maintaining optimal policy performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repairs reward functions using human feedback

Learns additive correction term from preferences

Uses targeted exploration for efficient preference learning

🔎 Similar Papers

REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback