Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Transformer-based video diffusion models (VDMs) suffer from quadratic computational complexity in self-attention, making long-sequence and high-resolution video generation prohibitively expensive. Method: We propose the first retraining-free linearization framework for pretrained VDMs, introducing a hybrid softmax/linear attention mechanism, lightweight knowledge distillation, and a hierarchical cost-aware block-wise rate scheduling strategy—enabling efficient adaptation without altering the original model architecture. Contribution/Results: Evaluated on the Wan2.1 1.3B model, our approach reduces attention FLOPs by 40% while preserving generation quality on the VBench benchmark. To our knowledge, this is the first work achieving sub-quadratic complexity and retraining-free linearization of pretrained video diffusion Transformers, successfully balancing representational capacity and inference efficiency. The framework establishes a new paradigm for practical, scalable video generation.

Technology Category

Application Category

📝 Abstract

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce extit{Attention Surgery}, an efficient framework for extit{linearizing} or extit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic self-attention cost in video diffusion models

Maintains generation quality while linearizing pretrained transformer attention

Balances expressiveness and efficiency with hybrid attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid attention mechanism mixing softmax and linear tokens

Lightweight distillation and fine-tuning pipeline for efficiency

Cost-aware block-rate strategy balancing expressiveness and efficiency

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation