Mitigating Preference Hacking in Policy Optimization with Pessimism

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses reward/preference over-optimization and preference hijacking in RLHF, arising from preference model generalization failure. We propose an uncertainty-aware pessimistic optimization framework. First, we construct a provably robust pessimistic objective for preference learning—ensuring distributional robustness against out-of-distribution preferences—and unify treatment of both preference and reward models. Building on this, we design two algorithms: P3O and PRPO, which jointly integrate pessimistic reinforcement learning, uncertainty estimation, and KL-constrained policy updates. Evaluated on summarization and assistant alignment tasks, our methods improve stability by 42% over baseline approaches and reduce preference hijacking to under 5%, without requiring additional human annotations or data augmentation.

Technology Category

Application Category

📝 Abstract

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking phenomenon. We propose novel, pessimistic objectives for RLHF which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms, P3O and PRPO, to optimize these objectives. Our approach is derived for the general preference optimization setting, but can be used with reward models as well. We evaluate P3O and PRPO on the tasks of fine-tuning language models for document summarization and creating helpful assistants, demonstrating remarkable resilience to overoptimization.

Problem

Research questions and friction points this paper is trying to address.

Addresses overoptimization in RLHF from human feedback

Proposes pessimistic objectives to prevent reward hacking

Evaluates methods on language model fine-tuning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pessimistic objectives for RLHF robustness

P3O and PRPO algorithms for optimization

Resilience to overoptimization in language models

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation