🤖 AI Summary
KV cache memory overhead in large language model (LLM) inference is substantial, and cross-turn reuse often introduces stale or redundant entries, degrading efficiency without compromising accuracy.
Method: This paper proposes an efficient KV cache compression and encoding framework tailored for conversational scenarios. Inspired by media compression, it jointly applies PCA-based feature decorrelation, adaptive quantization, and entropy coding—requiring no model parameter modification. A lightweight calibration step enables end-to-end transform coding.
Contribution/Results: Evaluated on Llama 3, Mistral NeMo, and R1-Qwen 2.5, the method achieves an average 20× compression ratio, exceeding 40× for shared prefix contexts. Crucially, it preserves full inference accuracy and long-context performance, outperforming state-of-the-art KV cache optimization baselines across all metrics.
📝 Abstract
Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$ imes$ compression while maintaining reasoning and long-context accuracy, and 40$ imes$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.