Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing DPO methods for preference alignment in text-to-image diffusion models suffer from bias in sigmoid-based probability estimation and insufficient diversity in offline preference data. To address these limitations, we propose Diffusion-DRO: an implicit ranking-learning framework that eliminates explicit reward modeling by directly translating pairwise preferences into denoising objectives, optimized via inverse reinforcement learning. Crucially, Diffusion-DRO introduces a hybrid sampling strategy that jointly leverages offline expert demonstrations and online policy-generated negative samples to enhance generalization. Extensive quantitative evaluation across multiple dimensions—and corroborated by human user studies—demonstrates that our method significantly outperforms state-of-the-art approaches. Notably, it achieves marked improvements in generation quality under both complex and zero-shot prompts, while exhibiting superior training stability and robustness to distributional shifts.

Technology Category

Application Category

📝 Abstract
Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.
Problem

Research questions and friction points this paper is trying to address.

Optimizing diffusion models using implicit user feedback
Overcoming non-linear estimation issues in preference learning
Integrating offline and online data for improved generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ranking-based optimization without reward model
Simplifies training into denoising formulation
Combines offline demonstrations with online negative samples
🔎 Similar Papers
No similar papers found.