🤖 AI Summary
This work addresses the computational bottleneck in long-context large language model inference, where attention mechanisms suffer from rapidly increasing costs due to growing key-value caches. Existing sparse attention methods struggle to balance accuracy, selection overhead, and computational efficiency. To overcome this, we propose Double-P, a hierarchical sparse attention framework that introduces a two-level top-p mechanism for the first time: it first performs coarse-grained cluster-level estimation using size-weighted centroids, followed by adaptive fine-grained token-level refinement. Coupled with dynamic sparsity scheduling, this approach enables joint optimization across three stages. Evaluated on multiple long-context benchmarks, Double-P achieves near-lossless accuracy while reducing attention computation overhead by up to 1.8× and accelerating end-to-end decoding by up to 1.3×.
📝 Abstract
As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.