🤖 AI Summary
This work addresses the trade-off between speed and generation quality in few-step diffusion distillation, where existing approaches that combine reinforcement learning (RL) with distillation often rely on unreliable sample-level rewards, leading to misaligned optimization objectives. The authors propose GDMD, a novel framework that treats the gradients from Distribution Matching Distillation (DMD) as implicit target tensors to construct gradient-level reward signals. This enables direct evaluation of the quality of distillation updates, aligning the RL policy with the distillation objective without resorting to conventional pixel-based scoring mechanisms. The method further achieves adaptive weight synchronization during training. Experiments demonstrate that the resulting 4-step generator surpasses both multi-step teacher models and current DMDR methods on GenEval and human preference evaluations, establishing a new state of the art in few-step generation and exhibiting strong scalability.
📝 Abstract
Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.