🤖 AI Summary
Existing open-source pyramid-based video diffusion models typically require training from scratch, struggling to balance generation quality and inference efficiency while often exhibiting suboptimal visual plausibility. This work proposes a low-cost fine-tuning approach that, for the first time, enables lossless transfer of pretrained video diffusion models into a pyramid architecture, thereby circumventing the performance degradation associated with training from scratch. By integrating a multi-resolution pyramid structure with a systematic step distillation strategy, the method significantly enhances inference efficiency without compromising generation quality. Experimental results demonstrate that the resulting model achieves visual plausibility on par with state-of-the-art methods, confirming the effectiveness and practicality of the proposed paradigm.
📝 Abstract
Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.