SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

📅 2024-06-17

🏛️ arXiv.org

📈 Citations: 22

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address high time-to-first-token (TTFT) latency in long-context LLMs—caused by the quadratic complexity of standard attention—and the accuracy degradation and retraining requirements of existing sparse attention methods, this paper proposes a fine-tuning-free adaptive structured sparse attention mechanism. Our method replaces vanilla attention modules in off-the-shelf LLMs without architectural modification. Key contributions include: (1) establishing the first theoretical and empirical foundation for near-lossless sparse attention; (2) designing head-specific dynamic sparsity patterns that jointly leverage local windowing and column-wise striping; and (3) introducing a two-stage query-guided KV selection with fixed-ratio adjacent-token sampling. Evaluated across multiple benchmarks, our approach achieves near-lossless accuracy (<0.3% performance drop) while reducing TTFT by up to 2.42× compared to FlashAttention—demonstrating superior efficiency–accuracy trade-offs for long-context inference.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42 imes$ compared with FlashAttention.

Problem

Research questions and friction points this paper is trying to address.

Reduces long Time-to-First-Token latency in long-context LLMs

Achieves near-lossless acceleration without accuracy sacrifice

Replaces vanilla attention without requiring retraining or finetuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive structured sparse attention for efficiency

Two-stage query-guided key-value filtering approach

Near-lossless acceleration with low overhead

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval