MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe I/O bottleneck in large language models during long-context decoding, caused by repeatedly reading an ever-growing KV cache. Existing acceleration methods often compromise attention accuracy or cache accessibility. To overcome this, the authors propose MAC-Attention, a mechanism that reuses previously computed attention results for semantically similar queries through three stages: matching, correction, and fusion. This approach achieves constant-time complexity while preserving full attention fidelity and cache accessibility. It integrates pre-RoPE L2 matching, localized boundary recomputation, and numerically stable fusion, and is compatible with I/O-aware kernels, paged KV caches, and MQA/GQA architectures. Experiments demonstrate a 99% reduction in KV cache accesses, over 60% lower token generation latency, a 14.3× speedup in attention computation, and a 2.6× end-to-end throughput improvement at 128K context length.
📝 Abstract
Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git
Problem

Research questions and friction points this paper is trying to address.

long-context decoding
IO-bound
KV cache
attention computation
LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

MAC-Attention
long-context decoding
attention reuse
KV cache optimization
compute reuse
🔎 Similar Papers
No similar papers found.
J
Jinghan Yao
The Ohio State University
S
Sam Adé Jacobs
Microsoft, WA, USA
Walid Krichene
Walid Krichene
Microsoft
OptimizationDifferential PrivacyRecommender SystemsOnline learning
M
Masahiro Tanaka
Anyscale, CA, USA
D
Dhabaleswar K Panda
The Ohio State University