Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of aligning efficient few-step diffusion distillation models with human preferences. To this end, the authors propose RTDMD, a two-stage framework that first introduces a reward-tilted teacher distribution and jointly optimizes distribution matching and reward maximization by minimizing the KL divergence between this teacher and the student generation distribution. Subsequently, they enhance generation consistency through AC-DMD and devise the SubGRPO algorithm, which integrates sub-interval distribution matching, consistency regularization, and a hybrid policy gradient combining GRPO with direct reward backpropagation to significantly reduce policy gradient variance. Evaluated on SD3, SD3.5, and FLUX.2 with only four denoising steps, the method achieves state-of-the-art performance across human preference, aesthetic quality, and compositional metrics.

📝 Abstract

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

Problem

Research questions and friction points this paper is trying to address.

few-step generation

human preference alignment

image generation

reward-guided learning

distribution matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-Tilted Distribution Matching

Few-step Generation

Distribution Matching Distillation