MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

249K/year
πŸ€– AI Summary
This work addresses the high computational overhead in long-context large language models caused by sparse attention mechanisms employing multi-head indexers. The authors propose MISA, a method that treats the indexing heads of DeepSeek’s sparse attention as a mixture-of-experts pool. A lightweight router dynamically activates only a few heads based on block-level statistics to perform fine-grained scoring, combined with a hierarchical re-ranking strategy. This approach significantly reduces the number of active indexing heads without incurring additional training costs. Experiments demonstrate that MISA achieves performance comparable to the original DeepSeek Attention (DSA) using only 1/8–1/4 of the indexing heads, matching LongBench scores on both DeepSeek-V3.2 and GLM-5. It preserves perfect NIAH capability at 128K context length, attains over 92% token selection recovery rate, and accelerates single-GPU inference by up to 3.82Γ—.
πŸ“ Abstract
DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.
Problem

Research questions and friction points this paper is trying to address.

long-context
sparse attention
indexer
LLM inference
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

MISA
sparse attention
mixture-of-experts
long-context inference
efficient indexing