Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Linear attention suffers from catastrophic forgetting of historical tokens due to its fixed-size recurrent state, severely degrading performance on retrieval-intensive tasks. To address this, we propose a hybrid sparse attention mechanism: (1) a query-aware, learnable key-value cache eviction strategy that adaptively retains critical historical information; (2) integration of sliding-window attention with an end-to-end trainable lightweight CNN module to jointly enhance local and long-range contextual modeling; and (3) an efficient Triton-based kernel for sparse computation. Our approach restores direct access to historical tokens while preserving linear time complexity. Experiments across multiple retrieval-intensive benchmarks demonstrate significant improvements over standard linear attention—reducing forgetting and boosting overall accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.

Problem

Research questions and friction points this paper is trying to address.

Addresses forgetfulness in linear attention models during retrieval tasks

Proposes hybrid sparse attention with learnable token eviction mechanism

Maintains constant complexity while improving long-sequence information retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid sparse attention mitigates linear attention forgetfulness

Learnable token eviction adaptively retains critical KV-pairs

End-to-end trainable CNN maintains constant complexity with sliding windows

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention