Reinforcement Learning from Human Feedback with Active Queries

📅 2024-02-14
🏛️ arXiv.org
📈 Citations: 12
Influential: 0
📄 PDF
🤖 AI Summary
Human preference annotation in Reinforcement Learning from Human Feedback (RLHF) is prohibitively expensive. Method: This paper proposes the first query-efficient active RLHF framework, modeling preference alignment as a contextual dueling bandit problem and introducing active learning to RLHF for the first time. We theoretically establish that our Active Proximal Policy Optimization (APPO) algorithm achieves optimal instance-dependent regret and query complexity bounds. Furthermore, we design the practical Active Dual Policy Optimization (ADPO) algorithm, integrating active query selection with enhanced PPO and Direct Preference Optimization (DPO). Results: Empirical evaluation shows that ADPO matches the performance of state-of-the-art DPO using only ~50% of human preference queries, significantly improving both query efficiency and alignment quality. Our work establishes a new paradigm for large language model alignment under stringent data constraints.

Technology Category

Application Category

📝 Abstract
Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $ ilde{O}(d^2/Delta)$ instance-dependent regret bound and an $ ilde{O}(d^2/Delta^2)$ query complexity, where $d$ is the dimension of feature space and $Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
Problem

Research questions and friction points this paper is trying to address.

Reducing human-labeled data cost
Improving query efficiency in RLHF
Aligning LLMs with human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active query RLHF
APPO algorithm
ADPO for LLMs
🔎 Similar Papers
No similar papers found.