A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

📅 2024-10-21
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and reliance on reinforcement learning in RLHF-based alignment of large language models (LLMs), this paper presents a systematic survey of Direct Preference Optimization (DPO)—a reinforcement-learning-free alignment paradigm grounded solely in preference data. We introduce the first multidimensional taxonomy of DPO, unifying its theoretical foundations, algorithmic variants, benchmark datasets, and application domains. Through rigorous analysis grounded in Bradley–Terry modeling, loss function characterization, and data quality assessment, we empirically synthesize over 120 works to identify DPO’s convergence conditions, data sensitivity patterns, and scenario-specific adaptation strategies. Crucially, we uncover its fundamental theoretical limitations, training biases, and generalization bottlenecks for the first time. Finally, we propose three key future directions: scalability enhancement, robustness improvement, and multimodal extension—providing a principled methodological foundation for efficient, stable human preference alignment.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human preferences efficiently
Reviewing DPO's theories, variants, and limitations
Exploring future directions for model alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO as RL-free human preference alignment
Comprehensive review of DPO challenges
Future research directions for DPO
Wenyi Xiao
Wenyi Xiao
Zhejiang University
Z
Zechuan Wang
Zhejiang University, China
Leilei Gan
Leilei Gan
Zhejiang University
NLPLLMsMultimodal LLMsAI+X
S
Shuai Zhao
Nanyang Technological University, Singapore
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
A
Anh Tuan Luu
Nanyang Technological University, Singapore
L
Long Chen
Alibaba Group, China
H
Hao Jiang
Alibaba Group, China
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
F
Fei Wu
Zhejiang University, China