Learning a Pessimistic Reward Model in RLHF

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

In offline RLHF, reward models often suffer from reward hacking, while KL regularization restricts policy deviation from behavioral data. To address these issues, this paper proposes Pessimistic Reward Tuning (PET), a novel framework for learning robust, regularization-free reward models. PET introduces a pessimistic bias constraint on reward estimation, inherently discouraging reward hacking during optimization. Its key contributions are: (i) the first regularization-free defense against reward hacking in offline RLHF; and (ii) enabling significantly larger policy divergence from the behavior dataset—up to 2.3× higher KL divergence—while maintaining state-of-the-art summary quality. Empirical evaluation on the TL;DR summarization task demonstrates that PET achieves an unprecedented balance between enhanced policy flexibility and consistent performance, establishing a more reliable and adaptable reward modeling paradigm for offline RLHF.

Technology Category

Application Category

📝 Abstract

This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.

Problem

Research questions and friction points this paper is trying to address.

Prevents reward hacking in offline RLHF without regularization

Learns pessimistic reward model robust against exploitation

Enables high-performance policies with high KL divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pessimistic reward model prevents reward hacking

No KL regularization needed in policy optimization

High KL divergence policy achieves high performance

🔎 Similar Papers

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret