🤖 AI Summary
This work addresses the high cost and low efficiency of human preference data collection in Reinforcement Learning from Human Feedback (RLHF). We propose a hybrid framework integrating Bayesian preference inference with active learning, embedding an active querying mechanism into the standard RLHF pipeline. By leveraging Bayesian uncertainty estimates, the method dynamically selects the most informative preference pairs for annotation, balancing scalability to large-scale tasks with sample efficiency. Compared to conventional RLHF and Preference-Based Optimization (PBO) approaches, our framework reduces the number of required preference queries by approximately 40–60%, while maintaining or even improving final policy performance on large language model fine-tuning and high-dimensional preference optimization tasks. The key contribution is the first systematic incorporation of Bayesian active learning into the RLHF preference acquisition stage, enabling joint optimization of scalability and query efficiency.
📝 Abstract
Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.