DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing acceleration methods for video diffusion models—such as feature caching and step distillation—often suffer from semantic and fine-detail degradation during compression, with quality deteriorating significantly when these techniques are combined. To address this, this work proposes a learnable feature caching mechanism compatible with distillation, replacing conventional heuristic strategies with a lightweight neural predictor. Crucially, it introduces the first co-design of feature caching and step distillation, incorporating a conservative Restricted MeanFlow distillation strategy to enable stable, high-ratio acceleration with minimal quality loss. Evaluated on large-scale video diffusion Transformers, the method achieves an 11.8× speedup while preserving generation fidelity, substantially outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code will be made publicly available soon.

Problem

Research questions and friction points this paper is trying to address.

video diffusion

feature caching

step distillation

computational acceleration

quality degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature Caching

Step Distillation

Video Diffusion Models