🤖 AI Summary
Mamba exhibits significant performance degradation on sequences exceeding its pretraining context length, revealing sensitivity to context extension. This work establishes, for the first time, a theoretical connection between state convergence behavior and the spectral properties of the state transition matrix ( mathbf{A} ), identifying out-of-distribution degradation as stemming from unbounded growth of ( mathbf{A} )’s spectral radius with input length. To address this, we propose Spectral Scaling: a post-training mechanism that selectively scales the eigenvalues of each layer’s ( mathbf{A} ) matrix to stabilize state evolution while preserving pretrained knowledge—requiring no fine-tuning or reparameterization. Evaluated across multiple long-sequence tasks, the method enables robust generalization up to 2×–4× beyond training length, substantially outperforming baselines such as ( Delta_t ) modulation. Our approach provides an interpretable, efficient, and plug-and-play solution for context extension in state space models.
📝 Abstract
The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $exp(-sum_{t=1}^NΔ_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $Δ_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.