๐ค AI Summary
Softmax-based attention suffers from quadratic time complexity, severely limiting modeling of ultra-long contextsโeven optimized implementations like FlashAttention struggle beyond 4M tokens. This paper introduces RACE Attention, the first attention mechanism that replaces Softmax with sharpened angular similarity and integrates random projection with soft locality-sensitive hashing (LSH) to achieve provably linear-time approximate attention. Theoretically guaranteed to scale to billion-token contexts, RACE Attention enables single forward passes over sequences of up to 12M tokens on GPU and 75M tokens on CPU. It significantly reduces memory consumption and latency while maintaining language modeling accuracy competitive with state-of-the-art baselines.
๐ Abstract
Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.