Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In video generation, the high computational cost of Transformer self-attention severely hinders long-sequence modeling. Existing sparsification methods—such as factorization or fixed-window attention—fail to effectively exploit inherent spatiotemporal redundancy in videos. To address this, we propose a hardware-aware structured sparse attention framework. First, we analyze attention distributions in video diffusion Transformers, revealing head-wise heterogeneous sparsity patterns. Second, we design an adaptive block partitioning strategy coupled with a time-varying sliding window mechanism to dynamically capture critical spatiotemporal dependencies. Third, we employ automated configuration search and hardware-friendly scheduling to optimize sparse computation. Our method achieves 1.6–2.5× attention speedup on a single GPU while matching the generation quality of full-attention baselines. It significantly improves efficiency for long-video synthesis without compromising fidelity.

Technology Category

Application Category

📝 Abstract
The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, where specialized heads dynamically attend to distinct spatiotemporal regions (e.g., local pattern, cross-shaped pattern, or global pattern). Existing sparse attention methods either impose rigid constraints or introduce significant overhead, limiting their effectiveness. To address this, we propose Compact Attention, a hardware-aware acceleration framework featuring three innovations: 1) Adaptive tiling strategies that approximate diverse spatial interaction patterns via dynamic tile grouping, 2) Temporally varying windows that adjust sparsity levels based on frame proximity, and 3) An automated configuration search algorithm that optimizes sparse patterns while preserving critical attention pathways. Our method achieves 1.6~2.5x acceleration in attention computation on single-GPU setups while maintaining comparable visual quality with full-attention baselines. This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation. Project Page: https://yo-ava.github.io/Compact-Attention.github.io/
Problem

Research questions and friction points this paper is trying to address.

Reducing computational demands in transformer-based video generation
Exploiting structured spatio-temporal sparsity for efficient attention
Overcoming limitations of rigid sparse patterns in video DiT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive tiling for dynamic spatial interaction
Temporally varying windows adjust sparsity
Automated search optimizes sparse patterns
🔎 Similar Papers
No similar papers found.
Q
Qirui Li
College of Computer Science & Technology, Zhejiang University
Guangcong Zheng
Guangcong Zheng
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.
Controllable Video/Image SynthesisDiffusion ModelPersonalization Generation Multi-ModalBEV
Q
Qi Zhao
College of Computer Science & Technology, Zhejiang University
J
Jie Li
College of Computer Science & Technology, Zhejiang University
B
Bin Dong
Huawei Technologies
Yiwu Yao
Yiwu Yao
Peking University
Artificial Intelligence
X
Xi Li
College of Computer Science & Technology, Zhejiang University