DPO-Shift: Shifting the Distribution of Direct Preference Optimization

๐Ÿ“… 2025-02-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
DPO and its variants are widely adopted for human preference alignment, yet suffer from โ€œlikelihood driftโ€: the generation probability of preferred responses monotonically decreases during training, undermining model stability and alignment quality. This work provides the first theoretical analysis quantifying the fundamental trade-off between increasing the likelihood of chosen responses and preserving the reward margin. Building on this insight, we propose a parameter-controllable probability distribution reweighting and gradient correction mechanism within the DPO framework, enabling targeted calibration of the preferred-response distribution. Our method incurs no additional model parameters, human annotations, or computational overhead, and seamlessly integrates with standard preference-based training pipelines. Empirical evaluation on MT-Bench and pairwise win-rate metrics demonstrates significant improvements over baseline DPO, effectively mitigating probability decay while enhancing generation consistency and robustness to preference alignment.

Technology Category

Application Category

๐Ÿ“ Abstract
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce method to controllably shift the distribution of the chosen probability. Then, we show that method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
Problem

Research questions and friction points this paper is trying to address.

address likelihood displacement in DPO
control distribution of chosen probability
improve model alignment with human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controllable distribution shifting
Mitigating likelihood displacement
Trade-off improvement and sacrifice
Xiliang Yang
Xiliang Yang
PhD students, Nanyang Technical University, CCDS
Bayesian inferencedifferential privacypreference optimizationoptimization
F
Feng Jiang
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Q
Qianen Zhang
School of Data Science, The Chinese University of Hong Kong, Shenzhen
L
Lei Zhao
Institute of Translational Medicine and National Center for Translational Medicine, Shanghai Jiao Tong University
X
Xiao Li
School of Data Science, The Chinese University of Hong Kong, Shenzhen