PILAF: Optimal Human Preference Sampling for Reward Modeling

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

In Reinforcement Learning from Human Feedback (RLHF), preference data sampling often deviates from true human values, causing reward models to only approximately fit—rather than explicitly optimize—the implicit oracle reward. Method: We propose PILAF, a Policy Interpolation-based Preference Sampling framework that generates high-quality response pairs via interpolation in policy space, and jointly optimizes for oracle reward consistency while scheduling iterative or online human feedback. Contribution/Results: We provide the first theoretical analysis establishing the optimality of preference sampling from both optimization and statistical perspectives. Experiments demonstrate that PILAF significantly improves reward model accuracy and policy alignment in both iterative and online RLHF settings, while substantially reducing the required volume of human feedback.

Technology Category

Application Category

📝 Abstract

As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

Problem

Research questions and friction points this paper is trying to address.

Optimizing human preference sampling

Aligning reward models with human values

Improving RLHF feedback curation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal human preference sampling

Policy-interpolated learning strategy

Online RLHF feedback curation

🔎 Similar Papers

No similar papers found.