Fast attention mechanisms: a tale of parallelism

πŸ“… 2025-09-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The quadratic time complexity of standard Transformers severely limits their scalability in massively parallel computation (MPC) modeling and deep reasoning tasks. Method: We propose ANNAβ€”a subquadratic approximate attention mechanism that preserves the expressive power of standard attention while explicitly supporting simulation of MPC algorithms. Contribution/Results: We theoretically prove that ANNA-Transformers exactly capture the computational power of MPC, unifying constant-depth low-rank Transformers for the first time. This establishes the first MPC-based framework for analyzing both expressivity and depth-efficiency of attention mechanisms. Empirically, ANNA achieves near-optimal reasoning depth on benchmarks such as Match2 and k-hop reasoning, significantly improving computational efficiency and model scalability without sacrificing accuracy.

Technology Category

Application Category

πŸ“ Abstract
Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.
Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity in transformer attention mechanisms
Maintaining expressive power with efficient attention approximation
Enabling scalable simulation of massively parallel computation algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximate Nearest Neighbor Attention mechanism
Sub-quadratic time complexity design
Simulates constant-depth low-rank transformers
πŸ”Ž Similar Papers
No similar papers found.