Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the inherent trade-off in state space models (SSMs) between expressivity and computational efficiency: unstructured transition matrices are highly expressive but computationally expensive, whereas structured matrices are efficient yet limited in capacity. To reconcile this tension, we propose Flash PD-SSM, which introduces a dynamic discrete selection mechanism that adaptively chooses among trainable structured sparse matrices at each time step. This approach maintains training efficiency while closely approximating the finite-state automaton expressivity of unstructured SSMs. Empirically, Flash PD-SSM substantially advances long-sequence modeling performance, setting a new accuracy record on multivariate time series forecasting over sequences exceeding 17,000 steps. Moreover, as a plug-and-play module, it consistently enhances performance, throughput, and memory efficiency across language tasks.

📝 Abstract

State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.

Problem

Research questions and friction points this paper is trying to address.

state-space models

expressivity

efficiency

structured sparse matrices

memory optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured sparse

state-space models

finite-state automaton expressivity