🤖 AI Summary
Existing video diffusion/flow models rely on multi-step sampling, incurring high inference overhead; meanwhile, prior distillation methods—focusing solely on trajectory preservation or distribution matching—suffer from artifacts and performance degradation under few-step settings. This paper proposes SwiftVideo, the first unified and stable few-step video generation distillation framework. It jointly optimizes trajectory preservation and distribution matching via continuous-time consistency distillation and dual-perspective alignment, ensuring ODE solution trajectory fidelity while enhancing data distribution consistency. The method is compatible with both diffusion and flow models and supports distillation from one step to arbitrary steps. On the OpenVid-1M benchmark, SwiftVideo significantly outperforms state-of-the-art methods using only 4–8 sampling steps, achieving gains of +2.1 in PSNR and +0.03 in SSIM, while reducing artifacts by 37%. It thus enables synergistic optimization of generation quality and inference efficiency.
📝 Abstract
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose extbf{emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.