Aligning Diffusion Models with Noise-Conditioned Perception

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

202K/year

🤖 AI Summary

Diffusion models optimized in pixel or VAE latent spaces suffer from misalignment with human perceptual judgments, leading to inefficiency in preference alignment training. To address this, we propose the first perceptual-alignment transfer to the U-Net noise-conditioned embedding space, introducing noise-conditioned perceptual objectives and jointly optimizing DPO, CPO, and SFT within this space. Our method significantly improves modeling of human visual preferences—including aesthetic appeal and prompt adherence—while reducing computational overhead. On the PartiPrompts benchmark, the SDXL variant achieves 60.8% general preference accuracy, 62.2% visual appeal score, and 52.1% prompt-following rate, outperforming the baseline SDXL-DPO across all metrics. Moreover, it is compatible with diverse optimization paradigms. This work establishes a novel, efficient, and perceptually consistent alignment framework for diffusion models.

Technology Category

Application Category

📝 Abstract

Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For SDXL, our approach provides 60.8% general preference, 62.2% visual appeal, and 52.1% prompt following against original open-sourced SDXL-DPO on the PartiPrompts dataset, while significantly reducing compute. Our approach not only improves the efficiency and quality of human preference alignment for diffusion models but is also easily integrable with other optimization techniques. The training code and LoRA weights will be available here: https://huggingface.co/alexgambashidze/SDXL_NCP-DPO_v0.1

Problem

Research questions and friction points this paper is trying to address.

Aligns diffusion models with human perception using perceptual objectives

Optimizes training efficiency and quality in preference alignment stage

Enhances visual appeal and prompt following in text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning diffusion models in U-Net embedding space

Using perceptual objective for human preference alignment

Applying DPO, CPO, SFT to enhance efficiency and quality

🔎 Similar Papers

Diffusion Model with Perceptual Loss