Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the high computational cost of attention computation and KV cache access in long-context language model inference, where existing sparse attention methods struggle to balance efficiency and accuracy. The authors propose a recursively enhanced exponentially decaying memory mechanism, which is integrated for the first time with multiple query-aware sparse attention approaches—such as Quest, MoBA, and SnapKV—within the RAT+ architecture. The model is further pretrained on 10 billion tokens using OLMo2-7B. Theoretical analysis elucidates how this memory mechanism enhances sparse attention performance. Experimental results across eight needle-in-a-haystack tasks demonstrate consistent and significant improvements over standard attention under various sparsity budgets, confirming the effectiveness and generality of the proposed memory module.

📝 Abstract

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

Problem

Research questions and friction points this paper is trying to address.

efficient inference

query-aware sparse attention

long-context language models

KV-cache sparsity

attention accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

exponentially decaying memory

query-aware sparse attention

RAT+