Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current diffusion models face two key bottlenecks in aligning with human preferences: (1) low optimization efficiency due to costly multi-step denoising gradient computations, and (2) reliance on offline-tuned reward models to ensure aesthetic quality. This paper introduces Direct-Align and SRPO—a novel framework enabling efficient, online preference alignment over the entire diffusion trajectory. It achieves this via text-conditioned differentiable reward modeling, noise-prior interpolation, and relative preference optimization—thereby avoiding over-optimization and enabling dynamic semantic adjustment. Integrated with FLUX.1.dev fine-tuning, the method significantly enhances photorealism and aesthetic quality without requiring continual offline reward calibration. Human evaluation demonstrates over 3× improvement in key alignment metrics, validating breakthroughs in both computational efficiency and alignment fidelity.

Technology Category

Application Category

📝 Abstract
Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of multistep denoising reward scoring
Minimizing reliance on offline reward model fine-tuning
Improving diffusion model aesthetic quality and realism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predefined noise prior for image recovery
Text-conditioned rewards for online adjustment
Optimized denoising and reward fine-tuning
🔎 Similar Papers
No similar papers found.
Xiangwei Shen
Xiangwei Shen
The Chinese University of Hong Kong, ShenZhen
Generative model
Zhimin Li
Zhimin Li
Vanderbilt University
VisualizationHPCMachine Learning
Z
Zhantao Yang
Hunyuan, Tencent
Shiyi Zhang
Shiyi Zhang
Tsinghua University
Video GenerationVideo Understanding
Y
Yingfang Zhang
Hunyuan, Tencent
D
Donghao Li
Hunyuan, Tencent
C
Chunyu Wang
Hunyuan, Tencent
Q
Qinglin Lu
Hunyuan, Tencent
Y
Yansong Tang
Shenzhen International Graduate School, Tsinghua University