Stochastic Sparse Attention for Memory-Bound Inference

๐Ÿ“… 2026-05-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

254K/year
๐Ÿค– AI Summary
This work addresses the memory bandwidth bottleneck of key-value caching in autoregressive decoding under long-context scenarios. The authors propose SANTA, a multiplication-free, low-variance stochastic sparse attention mechanism that samples a small number of indices from the softmax output distribution and aggregates only the corresponding value vectors to yield an unbiased estimate, thereby reformulating attention computation into memory accesses and additions. To enhance sparsity, SANTA incorporates hierarchical sampling for value aggregation and Bernoulli-based qKแต€ sampling for score computation. The method is efficiently deployable on GPUs and remains orthogonal to compression techniques such as quantization and low-rank approximation. Experiments demonstrate that, at a 32k context length, SANTA achieves a 1.5ร— speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada GPU while preserving baseline accuracy.
๐Ÿ“ Abstract
Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git
Problem

Research questions and friction points this paper is trying to address.

memory-bound inference
autoregressive decoding
KV cache
long-context attention
bandwidth limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Sparse Attention
Memory-Bound Inference
Multiplier-Free Attention
KV-Cache Sparsification
Stratified Sampling
๐Ÿ”Ž Similar Papers
No similar papers found.