🤖 AI Summary
In Transformer decoder-based large language model inference, KV caching incurs severe memory and memory-bandwidth bottlenecks as sequence length increases. This paper proposes KV-Latent, a novel paradigm that performs dimension-wise latent-space compression of KV vectors. To preserve positional fidelity in the compressed space, we introduce frequency-aware rotary position embedding (RoPE), ensuring stable positional encoding under dimensionality reduction. The framework enables component-wise independent compression analysis and natively supports architectures such as grouped-query attention. Deployment requires only minimal fine-tuning. Experiments across diverse models—including LLaMA and Phi-3—demonstrate up to 72% reduction in KV cache memory footprint and significant savings in key-value memory bandwidth, yielding up to 1.8× inference speedup, while incurring less than 0.3% perplexity degradation.
📝 Abstract
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model's performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.