🤖 AI Summary
This work addresses the quadratic complexity of attention mechanisms, which severely limits large language models’ ability to handle long contexts. Existing dynamic sparse attention methods face inherent trade-offs among preprocessing requirements, global coverage, query independence, and computational overhead. To overcome these limitations, we propose RRAttention, a dynamic block-sparse attention mechanism that employs a per-head round-robin shifting strategy to achieve global coverage, query independence, and low computational cost without any preprocessing. RRAttention integrates cross-head rotated sampling, stride-level attention aggregation, and adaptive Top-τ sparse selection, reducing complexity from O(L²) to O(L²/S²). Experiments show that RRAttention recovers over 99% of full attention performance on HELMET and Video-MME benchmarks while computing only half the attention blocks, and achieves a 2.4× speedup at 128K context length, significantly outperforming existing approaches.
📝 Abstract
The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$\tau$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.