Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational bottleneck in long-context large language model inference, where attention mechanisms suffer from rapidly increasing costs due to growing key-value caches. Existing sparse attention methods struggle to balance accuracy, selection overhead, and computational efficiency. To overcome this, we propose Double-P, a hierarchical sparse attention framework that introduces a two-level top-p mechanism for the first time: it first performs coarse-grained cluster-level estimation using size-weighted centroids, followed by adaptive fine-grained token-level refinement. Coupled with dynamic sparsity scheduling, this approach enables joint optimization across three stages. Evaluated on multiple long-context benchmarks, Double-P achieves near-lossless accuracy while reducing attention computation overhead by up to 1.8× and accelerating end-to-end decoding by up to 1.3×.

Technology Category

Application Category

📝 Abstract
As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.
Problem

Research questions and friction points this paper is trying to address.

long-context LLMs
sparse attention
top-p
attention bottleneck
decoding efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical sparse attention
top-p sparsity
long-context LLMs
adaptive token selection
efficient decoding
🔎 Similar Papers
No similar papers found.
Wentao Ni
Wentao Ni
University of California San Diego
Computer ArchitectureMachine Learning System
K
Kangqi Zhang
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, United States
Zhongming Yu
Zhongming Yu
University of California, San Diego
Computer SystemMachine Learning
O
Oren Nelson
Department of Computer Science and Engineering, University of California San Diego, La Jolla, United States
Mingu Lee
Mingu Lee
Qualcomm AI Research
AIMLLLMSignal processing
H
Hong Cai
Qualcomm AI Research, San Diego, United States
F
F. Porikli
Qualcomm AI Research, San Diego, United States
J
Jongryool Kim
SK hynix America, San Jose, United States
Zhijian Liu
Zhijian Liu
Research Scientist at NVIDIA, Assistant Professor at UC San Diego
Machine LearningEfficient Deep Learning
Jishen Zhao
Jishen Zhao
Professor at University of California, San Diego
Computer ArchitectureComputer SystemsMachine Learning SystemsElectronic Design Automation