🤖 AI Summary
Existing diffusion model distillation methods rely on regression or GAN-based losses, suffering from high training costs, instability, and performance bottlenecks. This work proposes an Embedding Loss that aligns the student model with the true data distribution in the embedding space of a randomly initialized, untrained neural network using Maximum Mean Discrepancy (MMD). This approach significantly enhances both training efficiency and generation quality for few-step—even single-step—generators. The method is compatible with various distillation frameworks such as DMD, DI, and CM, achieving state-of-the-art results with an unconditional FID of 1.475 and a conditional FID of 1.380 on CIFAR-10 while reducing training iterations by 80%. It consistently outperforms existing single-step distillation methods across multiple benchmarks, including ImageNet, AFHQ-v2, and FFHQ.
📝 Abstract
Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher's performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.