Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the degradation in image fidelity caused by the mismatch between terminal reward optimization and generation dynamics in single-step text-to-image models. To resolve this, the authors propose a data-free, trajectory-level alignment framework that, for the first time, applies integral KL divergence minimization to diffusion trajectories. This formulation yields a principled reward propagation mechanism, instantiated through the Diffused Reward Score and its computationally efficient proxy estimator, DRP, enabling both theoretically grounded and practically effective single-step reinforcement learning alignment. Experimental results demonstrate that the method achieves Pareto superiority on the SDXL benchmark and surpasses the preference alignment performance of a 50-step teacher model using only a single-step generation on a 6B DiT architecture.

📝 Abstract

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

Problem

Research questions and friction points this paper is trying to address.

one-step generator

reinforcement learning

diffusion model

reward optimization

image fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffused Reward

One-step Generation

Trajectory-level Alignment