🤖 AI Summary
Existing acceleration methods for video diffusion models—such as feature caching and step distillation—often suffer from semantic and fine-detail degradation during compression, with quality deteriorating significantly when these techniques are combined. To address this, this work proposes a learnable feature caching mechanism compatible with distillation, replacing conventional heuristic strategies with a lightweight neural predictor. Crucially, it introduces the first co-design of feature caching and step distillation, incorporating a conservative Restricted MeanFlow distillation strategy to enable stable, high-ratio acceleration with minimal quality loss. Evaluated on large-scale video diffusion Transformers, the method achieves an 11.8× speedup while preserving generation fidelity, substantially outperforming current state-of-the-art approaches.
📝 Abstract
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code will be made publicly available soon.