🤖 AI Summary
Existing sparse attention methods—such as approximate top-k selection and random sampling—struggle to simultaneously ensure cross-head and cross-query consistency while providing rigorous approximation quality guarantees, hindering practical deployment. This paper introduces vAttention, the first practical sparse attention mechanism supporting user-specified (ε, δ) statistical accuracy guarantees. It innovatively integrates top-k selection with bias-corrected random sampling, grounded in statistical estimation theory to construct a verifiable error-bound framework. vAttention delivers stable, unified full-attention approximations across multiple heads and queries. Empirically, it significantly improves sparse attention fidelity: achieving a 4.5-percentage-point gain on RULER-HARD and matching near-full-attention performance under 20× sparsity; on the AIME2024 benchmark, it attains full-model accuracy at just 10× sparsity, enabling high-throughput long-sequence generation and inference.
📝 Abstract
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.