MoM: Linear Sequence Modeling with Mixture-of-Memories

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing linear sequence models—including linear attention, state space models (SSMs), and linear RNNs—achieve O(L) training and O(1) inference complexity but rely on a single fixed-dimensional memory state, leading to long-range information loss and poor robustness against input perturbations, especially in high-recall language tasks. To address this, we propose the Multi-Memory Mixture (MMM) architecture, the first to incorporate neuroscientific principles of interference-resistant long-term memory into linear modeling. MMM employs a learnable routing network to dynamically dispatch tokens to parallel, independently updated memory units, augmented with sparse activation for computational efficiency. Crucially, it preserves linear-time complexity while substantially increasing memory capacity and robustness. Experiments demonstrate that MMM consistently outperforms state-of-the-art linear models on recall-intensive language tasks—matching or exceeding Transformer performance—without sacrificing inference speed or scalability.

Technology Category

Application Category

📝 Abstract
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating"memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
Problem

Research questions and friction points this paper is trying to address.

Enhances memory capacity in linear sequence modeling.
Reduces memory interference in recall-intensive tasks.
Maintains linear complexity in training and inference.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple independent memory states
Router network directs input tokens
Linear-complexity during training
J
Jusen Du
Shanghai AI Laboratory, Nanjing University
Weigao Sun
Weigao Sun
Research Scientist, Shanghai AI Laboratory
LLMDeep LearningOptimization
Disen Lan
Disen Lan
Ph.D. student, Fudan University
Large Language ModelEfficient Deep Learning
J
Jiaxi Hu
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yu Cheng
The Chinese University of Hong Kong