🤖 AI Summary
The quadratic computational complexity of Transformer self-attention severely hinders efficiency and scalability for long-sequence modeling. Existing linear-time alternatives—such as Mamba or sliding-window attention—improve throughput but compromise modeling of long-range dependencies due to inherent locality or fixed memory constraints. To address this, we propose SCOUT, a novel architecture featuring a segmented compression mechanism: input sequences are partitioned into fixed-length segments; intra-segment features are extracted via local linear mixers (supporting either Mamba or sliding-window variants); and inter-segment global context is dynamically aggregated using sparse historical checkpoint attention, operating at sub-quadratic complexity. This design jointly optimizes expressivity and efficiency. Empirically, SCOUT matches full-attention Transformer performance on 400M- and 1.3B-parameter models while achieving higher end-to-end throughput and significantly outperforming strong baselines on long-sequence tasks.
📝 Abstract
Transformers have demonstrated strong performance across a wide range of sequence modeling tasks, but their quadratic attention complexity limits scalability to long sequences. Linear models such as Mamba and sliding-window attention (SWA) address this by mixing tokens through recurrent or localized operations with fixed-size memory, achieving efficient inference. However, these methods risk degrading performance on long sequences due to their inability to retain detailed information from distant tokens. We propose SCOUT (Segment Compression for Optimized Utility in Transformers), a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations. Each token embedding is first enriched via a linear local mixer, Mamba or SWA, that integrates recent context. Then, instead of attending to all previous tokens, each token sparsely attends to a small number of compressed checkpoint tokens that summarize the input history. This design retains much of the expressivity of full attention while substantially reducing the computational and memory cost. By attending to compressed history rather than all previous tokens, SCOUT incurs slightly higher memory than purely linear models, but its growth rate remains sub-quadratic and far more scalable than that of full Transformers. We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks. SCOUT with both Mamba and SWA mixers outperforms strong long-sequence baselines under the same computational budget, matches full-attention Transformers on language modeling and common-sense reasoning tasks at 400M and 1.3B scales. Moreover, our SCOUT achieves higher end-to-end throughput than SOTA models, while delivering comparable results on long sequence benchmarks.