SpecSA: Bridging Speculative Decoding and Sparse Attention for Efficient LLM Inference

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

237K/year
๐Ÿค– AI Summary
This work addresses a structural mismatch between speculative decoding and dynamic sparse attention, which limits KV cache reuse, incurs high branching overhead, and makes verification strategies sensitive to input and execution patterns. To resolve this, the paper introduces SpecSA, the first framework that unifies these two techniques through a verification-aware sparse inference mechanism. SpecSA enables efficient KV cache reuse and co-optimizes decoding strategies via overlap-aware grouped query execution, a fused refresh-and-reuse NSA kernel, and profiler-driven prompt-adaptive scheduling. Evaluated on NVIDIA H100 GPUs, SpecSA achieves up to a 3.49ร— end-to-end throughput improvement and a 6.86ร— speedup in sparse speculative verification kernels.
๐Ÿ“ Abstract
Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SpecSA, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SpecSA combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SpecSA achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
sparse attention
KV-cache reuse
LLM inference
structural mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Sparse Attention
KV-cache Reuse
Kernel Fusion
Adaptive Orchestration