A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

📅 2024-10-21

📈 Citations: 6

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the high computational cost and reliance on reinforcement learning in RLHF-based alignment of large language models (LLMs), this paper presents a systematic survey of Direct Preference Optimization (DPO)—a reinforcement-learning-free alignment paradigm grounded solely in preference data. We introduce the first multidimensional taxonomy of DPO, unifying its theoretical foundations, algorithmic variants, benchmark datasets, and application domains. Through rigorous analysis grounded in Bradley–Terry modeling, loss function characterization, and data quality assessment, we empirically synthesize over 120 works to identify DPO’s convergence conditions, data sensitivity patterns, and scenario-specific adaptation strategies. Crucially, we uncover its fundamental theoretical limitations, training biases, and generalization bottlenecks for the first time. Finally, we propose three key future directions: scalability enhancement, robustness improvement, and multimodal extension—providing a principled methodological foundation for efficient, stable human preference alignment.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human preferences efficiently

Reviewing DPO's theories, variants, and limitations

Exploring future directions for model alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO as RL-free human preference alignment

Comprehensive review of DPO challenges

Future research directions for DPO

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization