🤖 AI Summary
This paper investigates Mamba’s fundamental capability to model Markov chains in in-context learning (ICL). Addressing the lack of theoretical understanding of its learning mechanism, the authors establish the first formal framework proving that a single-layer Mamba exactly implements both Bayesian and minimax-optimal Laplace-smoothed estimators—achieving statistically optimal prediction for first-order and arbitrary higher-order Markov processes. Crucially, this capability arises intrinsically from Mamba’s convolutional structure, not attention. The analysis integrates theoretical characterization of structured and selective state-space models (SSMs), representation capacity analysis, and empirical validation. This work provides the first formal statistical guarantees for Mamba’s ICL performance. Results demonstrate that Mamba not only outperforms Transformers on Markovian ICL tasks but also attains statistical optimality—matching the theoretical lower bound for prediction error under Laplace smoothing.
📝 Abstract
While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.