π€ AI Summary
The quadratic time complexity of standard Transformers severely limits their scalability in massively parallel computation (MPC) modeling and deep reasoning tasks.
Method: We propose ANNAβa subquadratic approximate attention mechanism that preserves the expressive power of standard attention while explicitly supporting simulation of MPC algorithms.
Contribution/Results: We theoretically prove that ANNA-Transformers exactly capture the computational power of MPC, unifying constant-depth low-rank Transformers for the first time. This establishes the first MPC-based framework for analyzing both expressivity and depth-efficiency of attention mechanisms. Empirically, ANNA achieves near-optimal reasoning depth on benchmarks such as Match2 and k-hop reasoning, significantly improving computational efficiency and model scalability without sacrificing accuracy.
π Abstract
Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.