A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work provides the first rigorous theoretical characterization of the expressive power gap among linear attention, hybrid attention, and standard full attention mechanisms. By constructing a multi-step reasoning task, we prove that full attention networks are strictly more powerful than hybrid architectures: even when equipped with exponentially more linear layers, hybrid models cannot solve tasks that a full attention network with just $L+1$ layers can accomplish. We establish complexity lower bounds for sequence function composition tasks under recursive variants of linear attention—such as those employed in Mamba and DeltaNet—thereby revealing fundamental limitations inherent to linear attention. Our analysis delineates a clear hierarchy of representational capacity, positioning full attention strictly above hybrid models, which in turn strictly dominate purely linear attention mechanisms.

Technology Category

Application Category

📝 Abstract

Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model's forward pass, an ($L+1$)-layer full attention network is sufficient, whereas any hybrid network interleaving $L-1$ layers of full attention with a substantially larger number ($2^{3L^2}$) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.

Problem

Research questions and friction points this paper is trying to address.

expressiveness hierarchy

hybrid attention

linear attention

full attention

theoretical characterization

Innovation

Methods, ideas, or system contributions that make the work stand out.

expressiveness hierarchy

hybrid attention

linear attention