Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing linear recurrent models exhibit limited performance on long-context retrieval and in-context learning tasks, while standard attention mechanisms incur prohibitive computational and memory costs. This work proposes Oryx, a hybrid architecture that dynamically switches between attention and linear recurrence (e.g., Mamba-2, Gated DeltaNet) as token mixers during sequence processing, sharing over 90% of its parameters to maintain a unified internal representation. Oryx is the first model to enable flexible, sequence-level mixing of these two mechanisms, overcoming the constraints of static architectures. At a scale of 1.4B parameters, Oryx achieves an average improvement of at least 0.7 percentage points on language modeling benchmarks; notably, it matches the retrieval performance of full-attention Transformers while applying attention to fewer than 10% of tokens.
📝 Abstract
Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.
Problem

Research questions and friction points this paper is trying to address.

linear recurrent models
long-context retrieval
in-context learning
token mixing
sequence modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid sequence modeling
dynamic mixer switching
shared representations
linear recurrent models
efficient attention
🔎 Similar Papers