🤖 AI Summary
Linear recurrent networks with diagonal state transition structures suffer from limited expressivity, hindering their ability to model long-range dependencies efficiently. To address this, this work proposes the High-order Linear Recurrent Unit (H-LRU) and the Block-Diagonal Linear Recurrent Unit (BD-LRU), which enhance state interaction through multi-step historical state fusion and dense intra-block channel mixing, respectively. A selective gating mechanism with L1 normalization is introduced to stabilize training, and parallel scan algorithms enable efficient computation. Experimental results demonstrate that BD-LRU matches or surpasses Mamba, DeltaNet, and LSTM on synthetic tasks, while H-LRU achieves the best parameter efficiency in compression tasks, confirming the critical role of state mixing mechanisms in enhancing model expressivity.
📝 Abstract
Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block sizes. A parallel-scan implementation of the proposed architectures keeps the throughput competitive with diagonal LRNNs for moderate orders (H-LRU) and block sizes (BD-LRU). In synthetic sequence modeling tasks, the performance of BD-LRU matches or exceeds those of linear SSMs (Mamba), low-rank LRNNs (DeltaNet) and LSTM baselines, while H-LRU is found to be the most parameter-efficient in compression task. In both synthetic sequence modeling and language modeling, our results indicate that the structure of state mixing rather than width alone shapes expressivity of LRNNs, offering a practical route to closing the efficiency-expressivity gap in linear sequence models.