Policy-labeled Preference Learning: Is Preference Enough for RLHF?

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing RLHF methods erroneously assume human-labeled trajectories are generated by an optimal policy, leading to biased trajectory likelihood estimation and suboptimal policy learning. To address this, we propose Preference Learning with Policy Annotation (PPL), the first framework to explicitly model regret in preference learning—thereby recovering behavioral policy information and calibrating the trajectory distribution. Our key contributions are: (1) a regret-based preference likelihood model that mitigates likelihood misestimation; and (2) a derived contrastive KL regularization term that enhances policy stability and alignment in sequential decision-making. Evaluated on high-dimensional continuous control tasks, PPL significantly improves offline RLHF performance and demonstrates strong generalization and robustness in online RLHF settings.

Technology Category

Application Category

📝 Abstract

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. Inspired by Direct Preference Optimization framework which directly learns optimal policy without explicit reward, we propose policy-labeled preference learning (PPL), to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information. We also provide a contrastive KL regularization, derived from regret-based principles, to enhance RLHF in sequential decision making. Experiments in high-dimensional continuous control tasks demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.

Problem

Research questions and friction points this paper is trying to address.

Resolves likelihood mismatch in RLHF by modeling preferences with regret

Introduces contrastive KL regularization for better sequential decision making

Improves offline and online RLHF performance in control tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy-labeled preference learning with regret modeling

Contrastive KL regularization for sequential decisions

Resolves likelihood mismatch in RLHF methods

🔎 Similar Papers

No similar papers found.