🤖 AI Summary
This work addresses the challenge that lightweight diffusion models struggle to replicate the complex denoising dynamics of teacher models in knowledge distillation, often leading to training instability and performance degradation. To overcome this, the authors propose a coarse-to-fine distillation framework: an initial coarse alignment is achieved via Linear Fitting (LIFT), followed by spatially adaptive guidance through Patch-wise Local Adaptive Coefficient Estimation (PLACE), which leverages error-based grouping for localized refinement. This approach introduces, for the first time, a two-stage distillation strategy combined with local adaptivity into diffusion model compression, demonstrating broad compatibility across image and latent spaces, U-Net and DiT backbones, and both conditional and unconditional generation tasks. With only 1.3M parameters (1.6% of the teacher model), the method achieves an FID of 15.73, substantially outperforming conventional knowledge distillation techniques, which typically suffer from severe FID degradation (often exceeding 50–200).
📝 Abstract
We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.