๐ค AI Summary
Existing sparse attention methods primarily target inference acceleration via predefined patterns, failing to bridge the training-inference gap and lacking fine-grained token selection across queries, key-value (KV) pairs, and attention heads. This work proposes a training-aware multidimensional sparse attention frameworkโthe first to jointly sparsify along three dimensions: (1) lazy/active query classification; (2) dynamic budget allocation guided by the flattest attention head; and (3) KV cache thinning. Leveraging semantic redundancy analysis and decoder-stage pattern recognition, it enables adaptive query selection, head-level dynamic KV pruning, shared budget allocation, and selective KV cache reading. Experiments show that, while preserving full-attention accuracy, the method achieves 2.7ร speedup in the prefill phase and reduces decoding memory by 2.4ร, significantly enhancing efficiency for long-video multimodal understanding.
๐ Abstract
Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.