Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models still face challenges in aligning with human preferences: existing diffusion-based direct preference optimization (DPO) methods, while widening the preference margin, concurrently amplify reconstruction errors for both preferred and dispreferred branches, degrading generation quality. To address this, we propose a protected gradient update mechanism. Through first-order analysis, we derive a closed-form scaling coefficient that adaptively suppresses gradients from dispreferred samples, ensuring the reconstruction error of preferred outputs does not increase during optimization. Our method is fully compatible with mainstream preference learning frameworks and incurs negligible computational overhead. Extensive experiments on multiple standard benchmarks demonstrate consistent and significant improvements over baselines across automatic preference scores, aesthetic quality metrics, and prompt alignment fidelity.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.
Problem

Research questions and friction points this paper is trying to address.

Optimizing human preference alignment in text-to-image diffusion models
Addressing reconstruction quality degradation during preference optimization
Developing safeguarded updates to prevent adverse effects on preferred outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Safeguarded update rule adaptively scales loser gradient
Closed-form scaling coefficient ensures non-increasing winner error
Model-agnostic method with minimal computational overhead added
🔎 Similar Papers
No similar papers found.
M
Minghao Fu
School of Artificial Intelligence, Nanjing University; National Key Laboratory for Novel Software Technology, Nanjing University; Alibaba International Digital Commerce Group
Guo-Hua Wang
Guo-Hua Wang
Alibaba
Machine LearningDeep Learning
T
Tianyu Cui
Alibaba International Digital Commerce Group
Qing-Guo Chen
Qing-Guo Chen
alibaba-inc
machine learning
Z
Zhao Xu
Alibaba International Digital Commerce Group
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce