Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing DPO methods for preference alignment in text-to-image diffusion models suffer from bias in sigmoid-based probability estimation and insufficient diversity in offline preference data. To address these limitations, we propose Diffusion-DRO: an implicit ranking-learning framework that eliminates explicit reward modeling by directly translating pairwise preferences into denoising objectives, optimized via inverse reinforcement learning. Crucially, Diffusion-DRO introduces a hybrid sampling strategy that jointly leverages offline expert demonstrations and online policy-generated negative samples to enhance generalization. Extensive quantitative evaluation across multiple dimensions—and corroborated by human user studies—demonstrates that our method significantly outperforms state-of-the-art approaches. Notably, it achieves marked improvements in generation quality under both complex and zero-shot prompts, while exhibiting superior training stability and robustness to distributional shifts.

Technology Category

Application Category

📝 Abstract

Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.

Problem

Research questions and friction points this paper is trying to address.

Optimizing diffusion models using implicit user feedback

Overcoming non-linear estimation issues in preference learning

Integrating offline and online data for improved generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ranking-based optimization without reward model

Simplifies training into denoising formulation

Combines offline demonstrations with online negative samples

🔎 Similar Papers

No similar papers found.