🤖 AI Summary
This work addresses the computational redundancy in conventional diffusion models for image and video generation, which arises from processing high-frequency noise at full resolution during early denoising steps. The authors propose a frequency-domain autoregressive generation framework that progressively increases resolution along the denoising trajectory via a spectral noise expansion mechanism and a power spectrum–guided optimal resolution scheduling strategy. This approach defers high-cost computations on noise-dominated components to later stages, thereby improving efficiency without sacrificing fidelity. Notably, the method can accelerate existing pretrained diffusion models without retraining and is complemented by a novel fine-tuning strategy to further enhance generation quality. Experimental results demonstrate significant gains in inference efficiency while maintaining high visual fidelity.
📝 Abstract
Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model's power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.