SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The quadratic computational complexity of Transformer self-attention severely hinders efficiency and scalability for long-sequence modeling. Existing linear-time alternatives—such as Mamba or sliding-window attention—improve throughput but compromise modeling of long-range dependencies due to inherent locality or fixed memory constraints. To address this, we propose SCOUT, a novel architecture featuring a segmented compression mechanism: input sequences are partitioned into fixed-length segments; intra-segment features are extracted via local linear mixers (supporting either Mamba or sliding-window variants); and inter-segment global context is dynamically aggregated using sparse historical checkpoint attention, operating at sub-quadratic complexity. This design jointly optimizes expressivity and efficiency. Empirically, SCOUT matches full-attention Transformer performance on 400M- and 1.3B-parameter models while achieving higher end-to-end throughput and significantly outperforming strong baselines on long-sequence tasks.

Technology Category

Application Category

📝 Abstract
Transformers have demonstrated strong performance across a wide range of sequence modeling tasks, but their quadratic attention complexity limits scalability to long sequences. Linear models such as Mamba and sliding-window attention (SWA) address this by mixing tokens through recurrent or localized operations with fixed-size memory, achieving efficient inference. However, these methods risk degrading performance on long sequences due to their inability to retain detailed information from distant tokens. We propose SCOUT (Segment Compression for Optimized Utility in Transformers), a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations. Each token embedding is first enriched via a linear local mixer, Mamba or SWA, that integrates recent context. Then, instead of attending to all previous tokens, each token sparsely attends to a small number of compressed checkpoint tokens that summarize the input history. This design retains much of the expressivity of full attention while substantially reducing the computational and memory cost. By attending to compressed history rather than all previous tokens, SCOUT incurs slightly higher memory than purely linear models, but its growth rate remains sub-quadratic and far more scalable than that of full Transformers. We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks. SCOUT with both Mamba and SWA mixers outperforms strong long-sequence baselines under the same computational budget, matches full-attention Transformers on language modeling and common-sense reasoning tasks at 400M and 1.3B scales. Moreover, our SCOUT achieves higher end-to-end throughput than SOTA models, while delivering comparable results on long sequence benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic attention complexity in Transformers for long sequences
Retains detailed information from distant tokens efficiently
Balances computational cost and model performance in long contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid architecture compressing tokens locally
Sparse attention to compressed checkpoint tokens
Sub-quadratic growth with maintained expressivity
🔎 Similar Papers
No similar papers found.
A
Aref Jafari
Huawei Noah’s Ark Lab, University of Waterloo
Y
Yuhe Fan
Huawei Noah’s Ark Lab
B
Benyamin Jamialahmadi
Huawei Noah’s Ark Lab
Parsa Farinneya
Parsa Farinneya
University of Toronto
Machine learningNLPRecommender system
Boxing Chen
Boxing Chen
Huawei Technologies Canada
Natual Language ProcessingArtificial Intelligence
M
Marzieh S. Tahaei
Huawei Noah’s Ark Lab