VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

๐Ÿ“… 2026-03-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of efficiently processing long video sequences with Transformer-based models, whose self-attention mechanism incurs quadratic computational complexity. Existing sparse attention methods struggle to balance accuracy and efficiency. To overcome this limitation, the authors propose VecAttention, which identifies and exploits a prominent vertical vector sparsity pattern in video attention maps for the first time. The method introduces a lightweight important vector selection module and an optimized sparse attention kernel that dynamically processes critical vertical vectors. Evaluated across multiple video understanding and generation benchmarks, VecAttention achieves 2.65ร— faster inference than full attention and outperforms state-of-the-art sparse approaches by 1.83ร— in speed, while maintaining comparable accuracyโ€”thus establishing a superior trade-off between sparsity and performance.
๐Ÿ“ Abstract
Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.
Problem

Research questions and friction points this paper is trying to address.

long-context video understanding
video generation
self-attention complexity
sparse attention
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

vector-wise sparse attention
vertical-vector sparsity
long-context video modeling
efficient Transformers
dynamic sparse attention
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Anmin Liu
SCS, Peking University, Beijing, China; Key Lab of HCST (PKU), MOE, Beijing, China
R
Ruixuan Yang
Fudan University, Shanghai, China
Huiqiang Jiang
Huiqiang Jiang
Microsoft Research Asia
Efficient AILLMsMLSys
B
Bin Lin
Alibaba Group, China
M
Minmin Sun
Alibaba Group, China
Y
Yong Li
Alibaba Group, China
Chen Zhang
Chen Zhang
Shanghai Jiao Tong University
Power electronics systems stability
Tao Xie
Tao Xie
Peking University Chair Professor, Fudan University Adjunct Top-Talent Professor
Software EngineeringSoftware TestingSoftware AnalyticsMining Software Repositories