Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Softmax-based attention suffers from quadratic time complexity, severely limiting modeling of ultra-long contexts—even optimized implementations like FlashAttention struggle beyond 4M tokens. This paper introduces RACE Attention, the first attention mechanism that replaces Softmax with sharpened angular similarity and integrates random projection with soft locality-sensitive hashing (LSH) to achieve provably linear-time approximate attention. Theoretically guaranteed to scale to billion-token contexts, RACE Attention enables single forward passes over sequences of up to 12M tokens on GPU and 75M tokens on CPU. It significantly reduces memory consumption and latency while maintaining language modeling accuracy competitive with state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.

Problem

Research questions and friction points this paper is trying to address.

Overcoming quadratic complexity of Softmax Attention for long contexts

Enabling billion-token processing with linear-time attention mechanism

Replacing exponential kernels with sharpened angular similarity and LSH

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces Softmax with sharpened angular similarity

Uses randomized projections and soft LSH

Enables linear-time attention for billion-token contexts

🔎 Similar Papers

Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences