ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the high cost and low efficiency of human preference annotation in large language model (LLM) alignment, this paper proposes an active learning framework that leverages the LLM itself as a parameterized, differentiable, nonlinear reward model for sample selection—departing from conventional linear reward assumptions. Our approach establishes a theoretically grounded, gradient-driven uncertainty estimation criterion and unifies implicit reward modeling, Direct Preference Optimization (DPO), and active sampling into an end-to-end optimization pipeline. Experiments across multiple LLMs and benchmark datasets demonstrate that our method achieves comparable or superior alignment performance using 30–50% fewer annotated preference pairs, significantly improving the efficiency of preference data utilization.

Technology Category

Application Category

📝 Abstract

The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

Problem

Research questions and friction points this paper is trying to address.

High cost of human preference annotation for LLM alignment

Lack of theoretical foundation in existing data selection methods

Need for efficient active data selection with non-linear reward functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

ActiveDPO uses LLM for reward model parameterization

Theoretical data selection for non-linear rewards

Explicitly considers LLM influence on data selection

🔎 Similar Papers

The Crucial Role of Samplers in Online Direct Preference Optimization