Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address pixel-space noise sensitivity, reliance on external vision-language models (VLMs), and low training efficiency in diffusion model preference optimization, this paper proposes Latent-space Step-level Preference Optimization (LPO). LPO innovatively repurposes the diffusion U-Net backbone as a noise-aware latent reward model (LRM), enabling direct preference modeling over noisy latent states and timestep-conditioned reward prediction with latent-space gradient updates. By operating entirely in the latent space, LPO eliminates pixel-space noise interference and VLM dependencies. It achieves state-of-the-art performance across multiple benchmarks in general preference modeling, image aesthetic quality, and image-text alignment. Moreover, LPO improves training efficiency by 2.5×–28× compared to prior methods, demonstrating both effectiveness and scalability.

Technology Category

Application Category

📝 Abstract
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28$ imes$ training speedup compared to existing preference optimization methods. Our code will be available at https://github.com/casiatao/LPO.
Problem

Research questions and friction points this paper is trying to address.

Aesthetic Optimization
Diffusion Models
Noise Resilience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Preference Optimization
Diffusion Models
Training Efficiency
🔎 Similar Papers
No similar papers found.
T
Tao Zhang
MAIS, CASIA, School of Artificial Intelligence, UCAS
Cheng Da
Cheng Da
Alibaba Group
deep learninghashingOCR
Kun Ding
Kun Ding
CASIA
CVMultimodal
K
Kun Jin
Kuaishou Technology
Y
Yan Li
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
D
Di Zhang
Kuaishou Technology
Shiming Xiang
Shiming Xiang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Distance Metric LearningSemi-supervised LearningManifold LearningRegressionFeature Selection
C
Chunhong Pan
MAIS, CASIA