Inference-time sparse attention with asymmetric indexing

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the prohibitive computational and memory overhead of Transformer self-attention during inference on long contexts (100k–500k tokens), this paper proposes SAAP—a fine-tuning-free, plug-and-play sparse attention mechanism. The core challenge lies in the asymmetric distribution of keys and queries, exacerbated by RoPE position encoding, which renders conventional symmetric vector indexing ineffective. To overcome this, SAAP introduces a novel *asymmetric indexing paradigm* that jointly models RoPE-induced geometric properties and intrinsic key-query distribution asymmetry, enabling data-adaptive heterogeneous partitioning. It further integrates GPU-optimized vector search, an offline lightweight classifier, and dynamic sparse pattern generation. Evaluated on Llama 3.1-8B, SAAP reduces memory accesses by 20× and cuts inference latency by 60% compared to FlashAttention-v2, while preserving full model accuracy and requiring no architectural or training modifications.

Technology Category

Application Category

📝 Abstract

Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compliant vector search algorithms, yet the standard partitioning methods yield poor results in this context, because (1) keys and queries follow different distributions and (2) the effect of RoPE positional encoding. In this paper, we introduce SAAP (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models without finetuning, as it only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences ranging from 100k to 500k tokens, our method typically reduces by a factor 20 the fraction of memory that needs to be looked-up, which translates to a time saving of 60% when compared to FlashAttention-v2.

Problem

Research questions and friction points this paper is trying to address.

Improves self-attention speed

Addresses key-query distribution mismatch

Reduces memory lookup in transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric indexing for self-attention

Data-adaptive sparsity pattern approximation

Offline query classifier training

🔎 Similar Papers

No similar papers found.