SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current supervised fine-tuning (SFT) of text-to-image diffusion models optimizes only pixel-level MSE loss, failing to jointly ensure global perceptual quality and structural consistency. To address this, we propose Self-supervised Direct Preference Optimization (SUDO), a novel paradigm that (i) introduces the first annotation-free, self-supervised mechanism for generating preference image pairs—jointly modeling local details and global semantics—and (ii) seamlessly integrates Direct Preference Optimization (DPO) into the diffusion training pipeline, jointly optimizing DPO loss, pixel-wise MSE, and text-conditioned sampling. Evaluated on Stable Diffusion 1.5 and XL, SUDO achieves significantly lower FID scores, alongside consistent improvements in CLIP Score and human evaluation metrics. Our method simultaneously enhances structural coherence and fine-grained realism, without requiring human-annotated preferences.

Technology Category

Application Category

📝 Abstract
Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at href{https://github.com/SPengLiang/SUDO}{this link}.
Problem

Research questions and friction points this paper is trying to address.

Enhances global image quality in diffusion models
Optimizes pixel-level and image-level details jointly
Reduces need for costly data annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised direct preference optimization for diffusion models
Generates preference image pairs without costly annotation
Optimizes both pixel-level and global image quality
🔎 Similar Papers
No similar papers found.