MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial computational and memory overhead imposed by large-scale key-value (KV) caches in long-context language modeling. To mitigate this challenge, the authors propose Memory-Keyed Attention (MKA), a novel attention mechanism that hierarchically integrates local, session-level, and long-term KV caches and employs a dynamic routing strategy to allocate attention efficiently. Building upon MKA, they further introduce FastMKA, which fuses multi-level memory sources prior to attention computation to enhance efficiency. Experimental results demonstrate that the proposed approach achieves training throughput improvements of up to 5× and reduces evaluation latency by 1.8× compared to baseline methods, while maintaining perplexity on par with Mixture-of-Experts-based Long-context Attention (MLA). This advancement represents a significant step toward scalable and efficient long-context language modeling.

Technology Category

Application Category

📝 Abstract
As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.
Problem

Research questions and friction points this paper is trying to address.

long-context reasoning
KV cache
attention mechanism
memory efficiency
computational bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Keyed Attention
hierarchical attention
KV cache optimization
long-context reasoning
efficient attention
🔎 Similar Papers
No similar papers found.