🤖 AI Summary
Transformer-based video diffusion models (VDMs) suffer from quadratic computational complexity in self-attention, making long-sequence and high-resolution video generation prohibitively expensive.
Method: We propose the first retraining-free linearization framework for pretrained VDMs, introducing a hybrid softmax/linear attention mechanism, lightweight knowledge distillation, and a hierarchical cost-aware block-wise rate scheduling strategy—enabling efficient adaptation without altering the original model architecture.
Contribution/Results: Evaluated on the Wan2.1 1.3B model, our approach reduces attention FLOPs by 40% while preserving generation quality on the VBench benchmark. To our knowledge, this is the first work achieving sub-quadratic complexity and retraining-free linearization of pretrained video diffusion Transformers, successfully balancing representational capacity and inference efficiency. The framework establishes a new paradigm for practical, scalable video generation.
📝 Abstract
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce extit{Attention Surgery}, an efficient framework for extit{linearizing} or extit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.