SPLA: Block Sparse Plus Linear Attention for Long Context Modeling

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing block-sparse attention methods, which suffer from cumulative information loss over long contexts due to imprecise block selection and the outright discarding of unselected blocks. To mitigate this, the authors propose the SPLA framework, which integrates a high-fidelity block selection strategy based on second-order Taylor expansion with a Residual Linear Attention (RLA) mechanism. Instead of discarding unselected blocks, SPLA implicitly compresses their contributions and enables efficient computation through a subtraction-based formulation that avoids explicit access to these blocks. Evaluated on long-context benchmarks such as RULER, SPLA outperforms dense attention models, substantially narrows the performance gap in continual pretraining, and preserves general knowledge and reasoning capabilities—effectively balancing computational efficiency, accuracy, and contextual completeness.

Technology Category

Application Category

📝 Abstract
Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining"long tail,"SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA -- calculating the residual as the difference between global and selected linear attention -- ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

block sparse attention
long-context modeling
contextual loss
selection fidelity
attention efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block Sparse Attention
Linear Attention
Taylor Expansion
Residual Compression
Long-context Modeling
🔎 Similar Papers
No similar papers found.
Bailin Wang
Bailin Wang
MIT CSAIL
natural language processingmachine learning
D
Dan Friedman
Apple, California, USA
T
Tao Lei
Apple, California, USA
Chong Wang
Chong Wang
Apple
machine learning