🤖 AI Summary
Transformer’s self-attention mechanism suffers from $O(n^2)$ time and memory complexity, severely limiting efficiency in long-sequence modeling. This work systematically evaluates sub-quadratic alternatives—including sparse attention, linear attention, state space models (SSMs), recurrent architectures, and hybrid designs—across theoretical complexity, empirical throughput, memory footprint, and downstream task performance. Through rigorous benchmarking on Long Range Arena, PG19, and WikiText, coupled with theoretical analysis, we find that several non-attention architectures significantly outperform standard Transformers on long-context tasks, achieving superior scalability and modeling capability. Our results empirically demonstrate that self-attention is not indispensable for effective sequence modeling, challenging its architectural primacy. The study provides concrete evidence and a principled roadmap for developing next-generation efficient sequence models, bridging theoretical advances with practical system-level gains.
📝 Abstract
Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.