Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the high cost and low efficiency of human preference data collection in Reinforcement Learning from Human Feedback (RLHF). We propose a hybrid framework integrating Bayesian preference inference with active learning, embedding an active querying mechanism into the standard RLHF pipeline. By leveraging Bayesian uncertainty estimates, the method dynamically selects the most informative preference pairs for annotation, balancing scalability to large-scale tasks with sample efficiency. Compared to conventional RLHF and Preference-Based Optimization (PBO) approaches, our framework reduces the number of required preference queries by approximately 40–60%, while maintaining or even improving final policy performance on large language model fine-tuning and high-dimensional preference optimization tasks. The key contribution is the first systematic incorporation of Bayesian active learning into the RLHF preference acquisition stage, enabling joint optimization of scalability and query efficiency.

Technology Category

Application Category

📝 Abstract

Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving sample efficiency in human preference learning

Combining RLHF scalability with PBO query efficiency

Enabling active preference gathering for model alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines RLHF scalability with PPO query efficiency

Integrates acquisition-driven module into RLHF pipeline

Enables active sample-efficient preference gathering

🔎 Similar Papers

No similar papers found.