The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Transformer’s self-attention mechanism suffers from $O(n^2)$ time and memory complexity, severely limiting efficiency in long-sequence modeling. This work systematically evaluates sub-quadratic alternatives—including sparse attention, linear attention, state space models (SSMs), recurrent architectures, and hybrid designs—across theoretical complexity, empirical throughput, memory footprint, and downstream task performance. Through rigorous benchmarking on Long Range Arena, PG19, and WikiText, coupled with theoretical analysis, we find that several non-attention architectures significantly outperform standard Transformers on long-context tasks, achieving superior scalability and modeling capability. Our results empirically demonstrate that self-attention is not indispensable for effective sequence modeling, challenging its architectural primacy. The study provides concrete evidence and a principled roadmap for developing next-generation efficient sequence models, bridging theoretical advances with practical system-level gains.

Technology Category

Application Category

📝 Abstract

Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.

Problem

Research questions and friction points this paper is trying to address.

Overcoming quadratic complexity bottleneck in transformer attention mechanisms

Surveying sub-quadratic alternatives to traditional transformer architectures

Assessing potential challengers to pure-attention transformer dominance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops sub-quadratic attention variants

Integrates recurrent neural network architectures

Combines state space models with hybrid designs

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration