OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing sparse attention methods primarily target inference acceleration via predefined patterns, failing to bridge the training-inference gap and lacking fine-grained token selection across queries, key-value (KV) pairs, and attention heads. This work proposes a training-aware multidimensional sparse attention framework—the first to jointly sparsify along three dimensions: (1) lazy/active query classification; (2) dynamic budget allocation guided by the flattest attention head; and (3) KV cache thinning. Leveraging semantic redundancy analysis and decoder-stage pattern recognition, it enables adaptive query selection, head-level dynamic KV pruning, shared budget allocation, and selective KV cache reading. Experiments show that, while preserving full-attention accuracy, the method achieves 2.7× speedup in the prefill phase and reduces decoding memory by 2.4×, significantly enhancing efficiency for long-video multimodal understanding.

Technology Category

Application Category

📝 Abstract

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

Problem

Research questions and friction points this paper is trying to address.

Bridges training-inference gap in sparse attention methods

Enables fine-grained token selection across queries, KV, and heads

Optimizes long-video MLLMs performance while accelerating inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-aware sparse attention for long-video MLLMs

Query selection via lazy-active classification mechanism

KV selection with head-level dynamic budget allocation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs