Sequential-Parallel Duality in Prefix Scannable Models

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the fundamental trade-off between parallel training and efficient sequential inference in sequence modeling. We propose Prefix-Scannable Models (PSMs), a formally defined class of neural sequence models that jointly support near-constant-time parallel evaluation and linear-time, constant-space sequential inference. PSMs achieve this by relaxing associativity constraints, generalizing state transition operators, and introducing a differentiable softmax-like aggregation mechanism with amortized O(1) computation and O(log N) memory. For the first time, this framework unifies diverse architectures—including Mamba, GLA, linear Transformers, and gated RNNs—under a single theoretical lens. Experiments demonstrate that PSMs retain Transformer-level expressivity on language modeling and synthetic tasks, match state-space models in inference efficiency, and significantly outperform both Transformers and state-space models on long-range extrapolation.

Technology Category

Application Category

📝 Abstract

Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.

Problem

Research questions and friction points this paper is trying to address.

Characterize neural sequence models with parallel and sequential efficiency

Generalize state space models to include non-associative functions

Evaluate models on language tasks and compare efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

State space models with prefix scan

Generalized Prefix-Scannable Models (PSMs)

Softmax-like operators for efficiency

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models