Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of attention computation and KV cache access in long-context language model inference, where existing sparse attention methods struggle to balance efficiency and accuracy. The authors propose a recursively enhanced exponentially decaying memory mechanism, which is integrated for the first time with multiple query-aware sparse attention approaches—such as Quest, MoBA, and SnapKV—within the RAT+ architecture. The model is further pretrained on 10 billion tokens using OLMo2-7B. Theoretical analysis elucidates how this memory mechanism enhances sparse attention performance. Experimental results across eight needle-in-a-haystack tasks demonstrate consistent and significant improvements over standard attention under various sparsity budgets, confirming the effectiveness and generality of the proposed memory module.
📝 Abstract
Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.
Problem

Research questions and friction points this paper is trying to address.

efficient inference
query-aware sparse attention
long-context language models
KV-cache sparsity
attention accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

exponentially decaying memory
query-aware sparse attention
RAT+
KV sparsity
long-context inference
🔎 Similar Papers
No similar papers found.