Learning a Pessimistic Reward Model in RLHF

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline RLHF, reward models often suffer from reward hacking, while KL regularization restricts policy deviation from behavioral data. To address these issues, this paper proposes Pessimistic Reward Tuning (PET), a novel framework for learning robust, regularization-free reward models. PET introduces a pessimistic bias constraint on reward estimation, inherently discouraging reward hacking during optimization. Its key contributions are: (i) the first regularization-free defense against reward hacking in offline RLHF; and (ii) enabling significantly larger policy divergence from the behavior dataset—up to 2.3× higher KL divergence—while maintaining state-of-the-art summary quality. Empirical evaluation on the TL;DR summarization task demonstrates that PET achieves an unprecedented balance between enhanced policy flexibility and consistent performance, establishing a more reliable and adaptable reward modeling paradigm for offline RLHF.

Technology Category

Application Category

📝 Abstract
This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.
Problem

Research questions and friction points this paper is trying to address.

Prevents reward hacking in offline RLHF without regularization
Learns pessimistic reward model robust against exploitation
Enables high-performance policies with high KL divergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pessimistic reward model prevents reward hacking
No KL regularization needed in policy optimization
High KL divergence policy achieves high performance
🔎 Similar Papers
No similar papers found.
Yinglun Xu
Yinglun Xu
University of Illinois Urbana Champaign
Machine LearningReinforcement Learning
H
Hangoo Kang
University of Illinois Urbana-Champaign
Tarun Suresh
Tarun Suresh
Undergraduate, University of Illinois Urbana-Champaign
Deep LearningMachine LearningReinforcement LearningProgramming LanguagesFormal Methods
Y
Yuxuan Wan
University of Illinois Urbana-Champaign
G
Gagandeep Singh
University of Illinois Urbana-Champaign