🤖 AI Summary
This work addresses the mode collapse and reduced sample diversity commonly induced by Distribution Matching Distillation (DMD) during accelerated generation. To mitigate this issue, the authors propose a role-disentangled distillation framework that separates the distillation process into distinct stages: the initial step preserves diversity through target prediction (e.g., v-prediction), while subsequent steps focus exclusively on enhancing generation quality. A gradient-blocking mechanism is introduced to prevent the first step from being adversely influenced by DMD optimization. Notably, the method achieves competitive visual fidelity in text-to-image generation without relying on discriminators, perceptual networks, or additional architectural components—demonstrating that simple role decoupling across distillation steps suffices to significantly improve sample diversity while matching the performance of current state-of-the-art approaches.
📝 Abstract
Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity -- no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images -- preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.