Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work investigates the trade-off between expressivity and computational efficiency in sequence modeling and proposes a novel hybrid architecture that integrates Transformers with state space models (SSMs). Through theoretical analysis, it establishes—for the first time—that pure Transformers or SSMs inherently suffer from fundamental limitations in parameter or memory requirements on certain tasks. To overcome this bottleneck, the authors construct a provably effective hybrid model. Experiments demonstrate that the proposed small-scale hybrid model outperforms non-hybrid counterparts with up to six times more parameters on tasks such as selective copying and associative recall, achieving significantly lower memory consumption while exhibiting superior length generalization and out-of-distribution robustness.

Technology Category

Application Category

📝 Abstract

Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where--and underlying mechanisms through which--they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family--namely selective copying and associative recall--we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

Problem

Research questions and friction points this paper is trying to address.

hybrid sequence models

expressivity-efficiency tradeoff

Transformers

state-space models

length generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid sequence models

expressivity-efficiency tradeoff

state-space models