🤖 AI Summary
This work addresses the substantial memory and computational burden imposed by key-value (KV) caching in Transformer pretraining, which has become a bottleneck for both training and autoregressive decoding. The authors propose Low-Rank KV Adaptation (LRKV), a method that shares full-rank KV projections across attention heads while introducing head-specific low-rank residual components. This approach significantly compresses the KV cache without sacrificing token-level resolution or inter-head diversity. LRKV establishes a continuous trade-off between fully shared and fully independent attention mechanisms, subsuming query-sharing strategies such as MQA and GQA within a unified framework, and differs fundamentally from latent-variable-based compression methods like MLA. Evaluated on a 2.5B-parameter model, LRKV reduces KV cache size by approximately 50%, decreases training FLOPs by 20–25%, and achieves faster convergence, lower validation perplexity, and improved downstream performance.
📝 Abstract
The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, providing a continuous trade-off between complete sharing and full independence. After pretraining models of size 128M to 6.3B parameters, LRKV consistently achieves the lowest test loss among standard MHA, MQA/GQA, and MLA while using only 45-53\% of MHA's KV cache. LRKV reaches equivalent baseline quality 18-25\% faster (measured in training steps). After supervised midtraining, LRKV achieves the highest downstream task performance across ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval benchmarks.