Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the critical latency bottleneck in long-context large language model inference caused by exact per-step Top-K selection in sparse attention decoding. We propose the Guess-Verify-Refine (GVR) algorithm, which for the first time incorporates temporal correlation into Top-K computation by leveraging the previous step’s result as a prior. GVR integrates pre-indexed statistics, secant-method-based threshold estimation, vote-free verification, and shared-memory refinement to achieve bit-level exactness while significantly accelerating computation. Implemented on NVIDIA’s Blackwell architecture and integrated into TensorRT-LLM, GVR delivers an average 1.88× speedup at the operator level and up to 2.42× per layer per step. In end-to-end evaluation with 100K-token contexts, it improves TPOT by up to 7.52%, with consistent gains observed in even longer contexts and speculative decoding scenarios.

Technology Category

Application Category

📝 Abstract

Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step's Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and validate the design on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM. GVR achieves an average \textbf{1.88x} single-operator speedup over the production radix-select kernel, with up to \textbf{2.42x} per layer per step, while preserving bit-exact Top-K outputs. In controlled TEP8 min-latency deployment, it improves end-to-end TPOT by up to \textbf{7.52%} at 100K context, with larger gains at longer contexts and smaller but still positive gains under speculative decoding. While implemented and validated in the current TensorRT-LLM DSA stack on Blackwell, the same principle may extend to sparse-attention decoders whose decode-phase Top-K exhibits temporal stability.

Problem

Research questions and friction points this paper is trying to address.

sparse-attention decoding

Top-K selection

temporal correlation

latency bottleneck

long-context LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Guess-Verify-Refine

Temporal Correlation

Sparse-Attention Decoding