Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing linear recurrent models exhibit limited performance on long-context retrieval and in-context learning tasks, while standard attention mechanisms incur prohibitive computational and memory costs. This work proposes Oryx, a hybrid architecture that dynamically switches between attention and linear recurrence (e.g., Mamba-2, Gated DeltaNet) as token mixers during sequence processing, sharing over 90% of its parameters to maintain a unified internal representation. Oryx is the first model to enable flexible, sequence-level mixing of these two mechanisms, overcoming the constraints of static architectures. At a scale of 1.4B parameters, Oryx achieves an average improvement of at least 0.7 percentage points on language modeling benchmarks; notably, it matches the retrieval performance of full-attention Transformers while applying attention to fewer than 10% of tokens.

📝 Abstract

Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.

Problem

Research questions and friction points this paper is trying to address.

linear recurrent models

long-context retrieval

in-context learning

token mixing

sequence modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid sequence modeling

dynamic mixer switching

shared representations