🤖 AI Summary
Human preference annotation in Reinforcement Learning from Human Feedback (RLHF) is prohibitively expensive.
Method: This paper proposes the first query-efficient active RLHF framework, modeling preference alignment as a contextual dueling bandit problem and introducing active learning to RLHF for the first time. We theoretically establish that our Active Proximal Policy Optimization (APPO) algorithm achieves optimal instance-dependent regret and query complexity bounds. Furthermore, we design the practical Active Dual Policy Optimization (ADPO) algorithm, integrating active query selection with enhanced PPO and Direct Preference Optimization (DPO).
Results: Empirical evaluation shows that ADPO matches the performance of state-of-the-art DPO using only ~50% of human preference queries, significantly improving both query efficiency and alignment quality. Our work establishes a new paradigm for large language model alignment under stringent data constraints.
📝 Abstract
Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $ ilde{O}(d^2/Delta)$ instance-dependent regret bound and an $ ilde{O}(d^2/Delta^2)$ query complexity, where $d$ is the dimension of feature space and $Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.