🤖 AI Summary
This paper identifies three intrinsic deficiencies of Direct Preference Optimization (DPO) in large language model alignment—“Drop” (sharp decline in refusal-response probability), “Dampening” (systematic suppression of high-quality responses), and “Diffusion” (degraded generalization to out-of-distribution responses)—collectively termed the 3D problem, arising from gradient interference between preference pairs that induces optimization instability. We propose the first gradient-dynamics-based theoretical framework for the 3D problem, establishing a causal link between instability and performance degradation. Furthermore, we design a lightweight, reward-model-free regularization method grounded in this framework. Extensive experiments on mathematical reasoning and instruction-following benchmarks demonstrate its broad applicability: the method significantly alleviates response suppression, enhances out-of-distribution robustness, and narrows the performance gap between DPO and reward-based methods—achieving near-parity with them.
📝 Abstract
Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more efficient alternative. While DPO offers simplicity, it remains underutilized in state-of-the-art LLMs, suggesting potential limitations. In this work, we revisit DPO, analyzing its theoretical foundations and empirical performance to bridge this gap. We identify three key properties, termed 3D properties, that emerge from DPO's learning process: Drastic drop in rejected response likelihood, Degradation into response suppression, and Dispersion effect on unseen responses. We show that these issues arise from DPO's optimization dynamics, where the interaction between chosen and rejected response gradients leads to instability. Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. To address these challenges, we propose simple regularization techniques that improve training stability and performance. Additionally, we examine how preference data distribution impacts DPO's effectiveness, offering insights into how alignment models handle out-of-domain (OOD) data. Our work connects these observations to broader research and provides a theoretical explanation for DPO's limitations. We hope these insights will guide future advancements in reward-model-free preference learning, bringing it closer to reward-model-based approaches.