Stochastic Sparse Attention for Memory-Bound Inference

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the memory bandwidth bottleneck of key-value caching in autoregressive decoding under long-context scenarios. The authors propose SANTA, a multiplication-free, low-variance stochastic sparse attention mechanism that samples a small number of indices from the softmax output distribution and aggregates only the corresponding value vectors to yield an unbiased estimate, thereby reformulating attention computation into memory accesses and additions. To enhance sparsity, SANTA incorporates hierarchical sampling for value aggregation and Bernoulli-based qKᵀ sampling for score computation. The method is efficiently deployable on GPUs and remains orthogonal to compression techniques such as quantization and low-rank approximation. Experiments demonstrate that, at a 32k context length, SANTA achieves a 1.5× speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada GPU while preserving baseline accuracy.

📝 Abstract

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

Problem

Research questions and friction points this paper is trying to address.

memory-bound inference

autoregressive decoding

KV cache

long-context attention

bandwidth limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Sparse Attention

Memory-Bound Inference

Multiplier-Free Attention