Mamba Modulation: On the Length Generalization of Mamba

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mamba exhibits significant performance degradation on sequences exceeding its pretraining context length, revealing sensitivity to context extension. This work establishes, for the first time, a theoretical connection between state convergence behavior and the spectral properties of the state transition matrix ( mathbf{A} ), identifying out-of-distribution degradation as stemming from unbounded growth of ( mathbf{A} )’s spectral radius with input length. To address this, we propose Spectral Scaling: a post-training mechanism that selectively scales the eigenvalues of each layer’s ( mathbf{A} ) matrix to stabilize state evolution while preserving pretrained knowledge—requiring no fine-tuning or reparameterization. Evaluated across multiple long-sequence tasks, the method enables robust generalization up to 2×–4× beyond training length, substantially outperforming baselines such as ( Delta_t ) modulation. Our approach provides an interpretable, efficient, and plug-and-play solution for context extension in state space models.

Technology Category

Application Category

📝 Abstract
The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $exp(-sum_{t=1}^NΔ_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $Δ_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
Problem

Research questions and friction points this paper is trying to address.

Mamba architecture performance deteriorates with longer contexts than pre-training
State-space dynamics show out-of-distribution behavior during length extension
Transition matrix spectrum affects state convergence in infinite input lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectrum scaling of transition matrix A
Modulating state-space dynamics for generalization
Selective spectrum adjustment in pre-trained models
🔎 Similar Papers
No similar papers found.
P
Peng Lu
Université de Montréal
J
Jerry Huang
Université de Montréal, Mila - Quebec AI Institute
Qiuhao Zeng
Qiuhao Zeng
Western University
Machine LearningTransfer LearningLLM
X
Xinyu Wang
McGill University
B
Boxing Wang
Noah’s Ark Lab
P
Philippe Langlais
Université de Montréal
Yufei Cui
Yufei Cui
McGill University, MILA
Medical AIRAGLLM AgentPredictive Uncertainty