HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the O(L²) computational bottleneck in fine-grained sparse attention for long contexts, which arises from global token indexing. To overcome this without retraining, the authors propose a two-stage hierarchical indexing approach: first, coarse-grained filtering based on block-level pooled representations prunes irrelevant regions; then, token-level refinement within the retained candidate blocks constructs an efficient sparse attention pattern. The method is compatible with the Sparse MLA operator and achieves 2× and 4× speedups at context lengths of 32K and 128K, respectively, while maintaining task performance nearly identical to the original DSA. Notably, the token selection exhibits over 99% intersection-over-union with the original sparse pattern, substantially reducing computational overhead without altering the sparsity structure.

Technology Category

Application Category

📝 Abstract

Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2$\times$ speedup at 32K context length and 4$\times$ at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.

Problem

Research questions and friction points this paper is trying to address.

sparse attention

indexer bottleneck

context length

fine-grained selection

quadratic complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Indexing

Sparse Attention

Token-level Selection