🤖 AI Summary
In long-context LLM inference, substantial KV cache loading overhead and the inflexibility of fixed-budget sparse attention hinder adaptation to dynamic token importance distributions. This paper proposes a dynamic sparse attention mechanism that adaptively selects salient tokens based on cumulative attention scores—eliminating manual budget specification or calibration. We introduce a novel importance estimation method that jointly leverages hierarchical clustering-based token ordering and attention score distribution fitting, enabling zero-calibration, low-overhead token importance modeling. Furthermore, we integrate KV cache compression with dynamic re-indexing to optimize both storage efficiency and memory access patterns. Evaluated across multiple long-text benchmarks, our approach achieves 7.29× speedup in attention decoding and 1.58× end-to-end inference acceleration, with negligible accuracy degradation—significantly outperforming state-of-the-art sparse attention methods.
📝 Abstract
Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.