KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Transformer decoder-based large language model inference, KV caching incurs severe memory and memory-bandwidth bottlenecks as sequence length increases. This paper proposes KV-Latent, a novel paradigm that performs dimension-wise latent-space compression of KV vectors. To preserve positional fidelity in the compressed space, we introduce frequency-aware rotary position embedding (RoPE), ensuring stable positional encoding under dimensionality reduction. The framework enables component-wise independent compression analysis and natively supports architectures such as grouped-query attention. Deployment requires only minimal fine-tuning. Experiments across diverse models—including LLaMA and Phi-3—demonstrate up to 72% reduction in KV cache memory footprint and significant savings in key-value memory bandwidth, yielding up to 1.8× inference speedup, while incurring less than 0.3% perplexity degradation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model's performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache memory and bandwidth in LLMs
Improves Rotary Positional Embedding stability in low dimensions
Enables efficient LLMs with minimal extra training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Down-sampling KV vectors to latent space
Enhanced Rotary Positional Embedding stability
Reduced KV Cache footprint efficiently
🔎 Similar Papers
No similar papers found.
Luohe Shi
Luohe Shi
Wuhan University
CSAINLP
Zuchao Li
Zuchao Li
Wuhan University
Natural Language ProcessingMachine Learning
Lefei Zhang
Lefei Zhang
School of Computer Science, Wuhan University
Pattern RecognitionMachine LearningImage ProcessingRemote Sensing
G
Guoming Liu
Xiaomi, Beijing, China
B
Baoyuan Qi
Xiaomi, Beijing, China
H
Hai Zhao
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China