🤖 AI Summary
To address the low inference efficiency and high computational cost of multi-step denoising in diffusion models, this paper proposes Scale-wise Distillation (SwD), a novel distillation framework. SwD generates high-fidelity images using only two full-resolution denoising steps, enabled by multi-scale progressive upsampling and cross-scale prediction. Its key contributions are: (1) the first scale-wise distillation paradigm, decoupling knowledge transfer across spatial scales; (2) a fine-grained patch-level loss that improves local structural alignment between generated and real data distributions; and (3) integration of an implicit spectral autoregressive prior to enhance high-frequency detail modeling. On text-to-image generation, SwD significantly outperforms existing distillation methods under identical computational budgets—achieving lower FID and higher CLIP-Score—and its perceptual quality superiority is further validated through human preference studies.
📝 Abstract
We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.