The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer’s self-attention mechanism suffers from $O(n^2)$ time and memory complexity, severely limiting efficiency in long-sequence modeling. This work systematically evaluates sub-quadratic alternatives—including sparse attention, linear attention, state space models (SSMs), recurrent architectures, and hybrid designs—across theoretical complexity, empirical throughput, memory footprint, and downstream task performance. Through rigorous benchmarking on Long Range Arena, PG19, and WikiText, coupled with theoretical analysis, we find that several non-attention architectures significantly outperform standard Transformers on long-context tasks, achieving superior scalability and modeling capability. Our results empirically demonstrate that self-attention is not indispensable for effective sequence modeling, challenging its architectural primacy. The study provides concrete evidence and a principled roadmap for developing next-generation efficient sequence models, bridging theoretical advances with practical system-level gains.

Technology Category

Application Category

📝 Abstract
Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.
Problem

Research questions and friction points this paper is trying to address.

Overcoming quadratic complexity bottleneck in transformer attention mechanisms
Surveying sub-quadratic alternatives to traditional transformer architectures
Assessing potential challengers to pure-attention transformer dominance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops sub-quadratic attention variants
Integrates recurrent neural network architectures
Combines state space models with hybrid designs
🔎 Similar Papers
No similar papers found.
A
Alexander M. Fichtl
Social Computing Group, Technical University of Munich
J
Jeremias Bohn
Social Computing Group, Technical University of Munich
J
Josefin Kelber
Social Computing Group, Technical University of Munich
Edoardo Mosca
Edoardo Mosca
Social Computing Group, Technical University of Munich
Georg Groh
Georg Groh
Adjunct Professor
Social ComputingNatural Language Processing