π€ AI Summary
Existing generalization analyses of Selective State Space Models (SSMs) for sequence modeling critically depend on sequence length, limiting theoretical understanding and practical applicability.
Method: We derive the first sequence-length-independent covering-number upper bound on generalization error, grounded in covering-number theory, state-space modeling, andζ³ε error bound derivation. Our analysis explicitly characterizes how state matrix stability and input-dependent discretization govern SSM generalization performance.
Contribution/Results: We establish a rigorous equivalence framework between SSMs and linear attention, revealing their fundamental connections to self-attention. The derived bound is both theoretically tight and empirically effective: experiments on long-sequence classification and modeling tasks demonstrate superior generalization over prior bounds and confirm the critical role of stability and discretization. This work provides the first length-agnostic theoretical foundation for SSM generalization and unifies perspectives across state-space models and attention-based architectures.
π Abstract
State-space models (SSMs) are a new class of foundation models that have emerged as a compelling alternative to Transformers and their attention mechanisms for sequence processing tasks. This paper provides a detailed theoretical analysis of selective SSMs, the core components of the Mamba and Mamba-2 architectures. We leverage the connection between selective SSMs and the self-attention mechanism to highlight the fundamental similarities between these models. Building on this connection, we establish a length independent covering number-based generalization bound for selective SSMs, providing a deeper understanding of their theoretical performance guarantees. We analyze the effects of state matrix stability and input-dependent discretization, shedding light on the critical role played by these factors in the generalization capabilities of selective SSMs. Finally, we empirically demonstrate the sequence length independence of the derived bounds on two tasks.