Active Preference Optimization for Sample Efficient RLHF

📅 2024-02-16
📈 Citations: 20
Influential: 5
📄 PDF
🤖 AI Summary
In Reinforcement Learning from Human Feedback (RLHF), high annotation costs and suboptimal policy learning under random sampling with limited preference data severely hinder efficient large language model alignment. Method: This paper reformulates RLHF as a contextual preference bandit problem, where prompts serve as contexts. We propose Active Preference Optimization (APO), a framework that adaptively selects the most informative prompt-generation pairs for human annotation. APO is built upon the Bradley–Terry–Luce (BTL) preference model and incorporates theoretical guarantees on convergence and sample efficiency. Contribution/Results: We prove that APO achieves an $O(1/sqrt{T})$ convergence rate and rigorously establish the inherent constant suboptimality of random sampling. Empirical evaluation on real-world preference datasets demonstrates that APO significantly outperforms existing methods under constrained annotation budgets, enabling cost-effective, high-sample-efficiency alignment of large models.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. Although aligned generative models have shown remarkable abilities in various tasks, their reliance on high-quality human preference data creates a costly bottleneck in the practical application of RLHF. One primary reason is that current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations, to collect human feedback, resulting in sub-optimal alignment under a constrained budget, which highlights the criticality of adaptive strategies in efficient alignment. Recent works [Mehta et al., 2023, Muldrew et al., 2024] have tried to address this problem by designing various heuristics based on generation uncertainty. However, either the assumptions in [Mehta et al., 2023] are restrictive, or [Muldrew et al., 2024] do not provide any rigorous theoretical guarantee. To address these, we reformulate RLHF within contextual preference bandit framework, treating prompts as contexts, and develop an active-learning algorithm, $ extit{Active Preference Optimization}$ ($ exttt{APO}$), which enhances model alignment by querying preference data from the most important samples, achieving superior performance for small sample budget. We analyze the theoretical performance guarantees of $ exttt{APO}$ under the BTL preference model showing that the suboptimality gap of the policy learned via $ exttt{APO}$ scales as $O(1/sqrt{T})$ for a budget of $T$. We also show that collecting preference data by choosing prompts randomly leads to a policy that suffers a constant sub-optimality. We perform detailed experimental evaluations on practical preference datasets to validate $ exttt{APO}$'s efficacy over the existing methods, establishing it as a sample-efficient and practical solution of alignment in a cost-effective and scalable manner.
Problem

Research questions and friction points this paper is trying to address.

High-cost human preference collection in RLHF for LLMs
Uniform context sampling leads to sub-optimal model policies
Need adaptive sampling for efficient alignment with limited data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Preference Optimization for RLHF
Adaptive context sampling strategy
Sub-optimality gap characterization
🔎 Similar Papers
No similar papers found.