Low-Rank Key Value Attention

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial memory and computational burden imposed by key-value (KV) caching in Transformer pretraining, which has become a bottleneck for both training and autoregressive decoding. The authors propose Low-Rank KV Adaptation (LRKV), a method that shares full-rank KV projections across attention heads while introducing head-specific low-rank residual components. This approach significantly compresses the KV cache without sacrificing token-level resolution or inter-head diversity. LRKV establishes a continuous trade-off between fully shared and fully independent attention mechanisms, subsuming query-sharing strategies such as MQA and GQA within a unified framework, and differs fundamentally from latent-variable-based compression methods like MLA. Evaluated on a 2.5B-parameter model, LRKV reduces KV cache size by approximately 50%, decreases training FLOPs by 20–25%, and achieves faster convergence, lower validation perplexity, and improved downstream performance.

Technology Category

Application Category

📝 Abstract
The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, providing a continuous trade-off between complete sharing and full independence. After pretraining models of size 128M to 6.3B parameters, LRKV consistently achieves the lowest test loss among standard MHA, MQA/GQA, and MLA while using only 45-53\% of MHA's KV cache. LRKV reaches equivalent baseline quality 18-25\% faster (measured in training steps). After supervised midtraining, LRKV achieves the highest downstream task performance across ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval benchmarks.
Problem

Research questions and friction points this paper is trying to address.

KV cache
memory bottleneck
compute constraints
Transformer pretraining
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank KV Adaptation
KV Cache Compression
Multi-Head Attention
Transformer Scaling
Memory-Efficient Attention
🔎 Similar Papers
No similar papers found.
J
James O'Neill
AI Group, Intercom, 124 St Stephen’s Green, Dublin 2, D02 C628, Ireland
R
Robert Clancy
AI Group, Intercom, 124 St Stephen’s Green, Dublin 2, D02 C628, Ireland
M
Mariia Matskevichus
AI Group, Intercom, 124 St Stephen’s Green, Dublin 2, D02 C628, Ireland
Fergal Reid
Fergal Reid
University College Dublin, Ireland
Network AnalysisMachine Learning