🤖 AI Summary
This work investigates whether multi-head, multi-layer Transformers can be computed more efficiently than the standard approach of independently processing each attention head. By integrating the Strong Exponential Time Hypothesis (SETH), the matrix multiplication exponent ω, the Baur–Strassen theorem, and techniques from arithmetic circuit complexity, the paper establishes the first non-trivial computational lower bounds for such architectures. It shows that when the embedding dimension is small, any algorithm requires essentially LHN²⁺ᵒ⁽¹⁾ time, which is nearly optimal; when the embedding dimension is large, at least LHN^ω⁻ᵒ⁽¹⁾ arithmetic operations are necessary, implying that the conventional independent-head computation is asymptotically near-optimal. These results reveal fundamental computational limits inherent in the Transformer attention mechanism.
📝 Abstract
The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$. The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention.
Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than $LH$ independent evaluations of attention.
In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime ($m = N^{o(1)}$), computing $LH$ attention heads separately takes $LHN^{2 + o(1)}$ time. We establish that this is essentially optimal under SETH. In the large embedding regime ($m = N$), one can compute $LH$ attention heads separately using $LHN^{ω+ o(1)}$ arithmetic operations (plus exponents), where $ω$ is the matrix multiplication exponent. We establish that this is optimal, by showing that $LHN^{ω- o(1)}$ arithmetic operations are necessary when $ω> 2$. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.