Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

State space models (SSMs) suffer from theoretical limitations in capturing long-range dependencies, particularly in multi-query joint recall tasks under long-context settings. To address this, we propose Context-Dependent Sparse Attention (CDSA), the first sparse attention mechanism that theoretically guarantees solvability of multi-query joint recall with subquadratic computational complexity. Building upon CDSA, we design the HAX architecture—a hybrid model integrating SSMs with context-aware sparse attention, enhanced by locality-sensitive hashing and context-guided key selection—specifically tailored for long-sequence modeling in NLP. Extensive evaluations on synthetic and real-world long-context benchmarks demonstrate that HAX consistently outperforms both pure SSM baselines and context-agnostic sparse attention models, validating its superior expressive capacity and practical effectiveness in long-context understanding.

Technology Category

Application Category

📝 Abstract

Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).

Problem

Research questions and friction points this paper is trying to address.

Improving long-context modeling in state-space models (SSMs)

Addressing SSMs' inability to solve multi-query joint recall efficiently

Integrating SSMs with context-dependent sparse attention for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends associative recall to joint recall task

Integrates SSMs with Context-Dependent Sparse Attention

Proposes HAX for natural language applications

🔎 Similar Papers

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities