🤖 AI Summary
To address the high computational cost and reliance on reinforcement learning in RLHF-based alignment of large language models (LLMs), this paper presents a systematic survey of Direct Preference Optimization (DPO)—a reinforcement-learning-free alignment paradigm grounded solely in preference data. We introduce the first multidimensional taxonomy of DPO, unifying its theoretical foundations, algorithmic variants, benchmark datasets, and application domains. Through rigorous analysis grounded in Bradley–Terry modeling, loss function characterization, and data quality assessment, we empirically synthesize over 120 works to identify DPO’s convergence conditions, data sensitivity patterns, and scenario-specific adaptation strategies. Crucially, we uncover its fundamental theoretical limitations, training biases, and generalization bottlenecks for the first time. Finally, we propose three key future directions: scalability enhancement, robustness improvement, and multimodal extension—providing a principled methodological foundation for efficient, stable human preference alignment.
📝 Abstract
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community.