🤖 AI Summary
To address training instability arising from timestep dependence and off-policy bias in direct preference optimization (DPO) for diffusion models, this paper proposes an importance-sampling-driven robust DPO framework. It is the first to systematically integrate importance sampling into diffusion preference optimization, theoretically eliminating bias induced by policy-data distribution mismatch. We introduce a timestep-aware gradient clipping and masking strategy (DPO-C&M) to suppress gradient variance during early, high-noise timesteps. Furthermore, we design an end-to-end distribution correction framework, SDPO, which jointly performs reverse trajectory modeling and adaptive importance-weighting. Experiments on CogVideoX-2B/5B and Wan2.1-1.3B demonstrate that our method significantly improves VBench scores and human preference win rates, achieves more stable training convergence, and comprehensively outperforms Diffusion-DPO.
📝 Abstract
Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.