Scaling Linear Attention with Sparse State Expansion

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Transformers face prohibitive computational and memory overhead in long-context modeling; existing linear attention methods mitigate this via fixed-size state compression, but at the cost of degraded retrieval and reasoning performance. This paper proposes Sparse State Expansion (SSE), a novel framework comprising: (i) a softmax-based top-k hard classification mechanism for row-wise sparse state updates; (ii) multi-partition state expansion that decouples parameter count from state capacity, enhancing context awareness and feature disentanglement; and (iii) efficient parallel implementations for both pure linear and hybrid architectures. Experiments demonstrate that SSE-H—a 2B-parameter model—outperforms comparably sized open-source Transformers across language modeling, context retrieval, and mathematical reasoning. Notably, it achieves 64.7 and 51.3 points on AIME 2024 and 2025, respectively, establishing state-of-the-art performance for compact models.

Technology Category

Application Category

📝 Abstract

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

Problem

Research questions and friction points this paper is trying to address.

Addresses quadratic computation in Transformers for long-context scenarios

Improves linear attention with sparse state updates and expansion

Enhances performance in retrieval and reasoning tasks efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Row-sparse update via softmax-based top-k classification

Sparse State Expansion (SSE) for decoupling parameters

Efficient parallelized implementation for discriminative states

🔎 Similar Papers

No similar papers found.