🤖 AI Summary
In Reinforcement Learning from Human Feedback (RLHF), preference data sampling often deviates from true human values, causing reward models to only approximately fit—rather than explicitly optimize—the implicit oracle reward. Method: We propose PILAF, a Policy Interpolation-based Preference Sampling framework that generates high-quality response pairs via interpolation in policy space, and jointly optimizes for oracle reward consistency while scheduling iterative or online human feedback. Contribution/Results: We provide the first theoretical analysis establishing the optimality of preference sampling from both optimization and statistical perspectives. Experiments demonstrate that PILAF significantly improves reward model accuracy and policy alignment in both iterative and online RLHF settings, while substantially reducing the required volume of human feedback.
📝 Abstract
As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.