🤖 AI Summary
Balancing expressiveness and computational efficiency remains challenging in sequence modeling. Method: This paper proposes Structured Linear Controlled Differential Equations (SLiCEs), a framework centered on input-dependent structured state-transition matrices—specifically block-diagonal, sparse, and Walsh–Hadamard variants—whose expressive capacity is theoretically proven to match that of dense matrices for the first time. SLiCEs unify and generalize several state-of-the-art models—including S4, Mamba, and DeltaNet—via two novel architectural variants. The approach enables fully parallelizable training while preserving rigorous mathematical interpretability and computational efficiency. Contributions/Results: A single-layer SLiCE solves the A₅ state-tracking task; it achieves significantly superior length generalization on regular languages compared to existing parallel sequential models; and it attains state-of-the-art performance on six multivariate time-series classification benchmarks, reducing training time by 20×.
📝 Abstract
Structured Linear Controlled Differential Equations (SLiCEs) provide a unifying framework for sequence models with structured, input-dependent state-transition matrices that retain the maximal expressivity of dense matrices whilst being cheaper to compute. The framework encompasses existing architectures, such as input-dependent block-diagonal linear recurrent neural networks and DeltaNet's diagonal-plus-low-rank structure, as well as two novel variants based on sparsity and the Walsh--Hadamard transform. We prove that, unlike the diagonal state-transition matrices of S4 and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh--Hadamard matrices match the maximal expressivity of dense matrices. Empirically, SLiCEs solve the $A_5$ state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the state-of-the-art performance of log neural controlled differential equations on six multivariate time-series classification datasets while cutting the average time per training step by a factor of twenty.