Low-Rank Key Value Attention

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the substantial memory and computational burden imposed by key-value (KV) caching in Transformer pretraining, which has become a bottleneck for both training and autoregressive decoding. The authors propose Low-Rank KV Adaptation (LRKV), a method that shares full-rank KV projections across attention heads while introducing head-specific low-rank residual components. This approach significantly compresses the KV cache without sacrificing token-level resolution or inter-head diversity. LRKV establishes a continuous trade-off between fully shared and fully independent attention mechanisms, subsuming query-sharing strategies such as MQA and GQA within a unified framework, and differs fundamentally from latent-variable-based compression methods like MLA. Evaluated on a 2.5B-parameter model, LRKV reduces KV cache size by approximately 50%, decreases training FLOPs by 20–25%, and achieves faster convergence, lower validation perplexity, and improved downstream performance.

Technology Category

Application Category

📝 Abstract

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, providing a continuous trade-off between complete sharing and full independence. After pretraining models of size 128M to 6.3B parameters, LRKV consistently achieves the lowest test loss among standard MHA, MQA/GQA, and MLA while using only 45-53\% of MHA's KV cache. LRKV reaches equivalent baseline quality 18-25\% faster (measured in training steps). After supervised midtraining, LRKV achieves the highest downstream task performance across ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval benchmarks.

Problem

Research questions and friction points this paper is trying to address.

KV cache

memory bottleneck

compute constraints

Transformer pretraining

attention mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank KV Adaptation

KV Cache Compression

Multi-Head Attention