🤖 AI Summary
This work investigates the length generalization capability of selective state space models (SSMs) on regular language tasks—such as finite-state automata (FSA)—where existing SSMs fail to extrapolate to unseen sequence lengths. To address this, we propose the first single-layer selective dense SSM (SD-SSM) achieving perfect length generalization. SD-SSM introduces a dense transition matrix dictionary coupled with a timestep-adaptive softmax convex combination mechanism, augmented by layer normalization and linear readout. Theoretical analysis characterizes its distinct generalization behavior on commutative versus non-commutative FSAs and identifies the underlying structural causes. Empirically, SD-SSM attains 100% accuracy on length extrapolation across diverse regular language benchmarks, substantially outperforming standard SSMs and their variants. This establishes SD-SSM as a novel paradigm for structured sequence modeling that jointly achieves high expressivity and robust length generalization.
📝 Abstract
Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations. Our code is available at https://github.com/IBM/selective-dense-state-space-model.