Interdomain Attention: Beyond Token-Level Key-Value Memory

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the high computational cost of standard Transformer attention and the limited query-conditioned expressivity of state space models (SSMs), which, despite their length-independent scalability, lack content-aware matching. The authors propose Interdomain Attention, a novel architecture that embeds SSMs into attention modules via kernel methods: by approximating the attention kernel with finite feature maps, keys and values are projected onto a shared set of SSM-maintained basis functions, while queries attend to the compressed coefficients through their own feature mappings. This enables query-conditioned attention within a fixed-size state for the first time. Combining the content-matching strength of attention with the scalability of SSMs, the method consistently outperforms pure SSM mixers under identical state budgets across language models ranging from 125M to 1.3B parameters. The 1.3B variant surpasses standard softmax attention baselines in validation perplexity and eight commonsense reasoning tasks, maintaining stable performance even when trained on contexts 3.5× longer.

📝 Abstract

Transformers and deep state space models (SSMs) sit at opposite ends of a basic design choice: attention routes each query through a growing key-value (KV) cache by content-based matching at quadratic cost, while deep SSMs compress context into a fixed-size recurrent state that is not directly addressed by query-key matching. We propose Interdomain Attention, which integrates an SSM into an attention module through kernel methods: an attention kernel is approximated by a finite feature map, the resulting key features and values are projected onto a shared set of basis functions maintained by a single SSM recurrence, and each query attends to the compressed coefficients through its own feature map, recovering query-conditioned attention over a fixed-size state. The scalable layer is a learned relaxation of this derivation, and we validate its components through ablations. In a 125M to 1.3B autoregressive language-modeling study on FineWeb-Edu at matched recurrent-state budget, Interdomain Attention improves on an SSM token mixer at every scale, surpasses a same-recipe softmax baseline at 1.3B on validation perplexity and on the eight-task commonsense suite, and inherits the length-flat behavior of its fixed-state core out to 3.5x the training context. Ablations indicate that the query-conditioned projection is the main source of the gain.

Problem

Research questions and friction points this paper is trying to address.

Interdomain Attention

Transformers

State Space Models

Attention Mechanism

Fixed-Size State

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interdomain Attention

State Space Models

Kernel Methods