KV Cache Transform Coding for Compact Storage in LLM Inference

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

KV cache memory overhead in large language model (LLM) inference is substantial, and cross-turn reuse often introduces stale or redundant entries, degrading efficiency without compromising accuracy. Method: This paper proposes an efficient KV cache compression and encoding framework tailored for conversational scenarios. Inspired by media compression, it jointly applies PCA-based feature decorrelation, adaptive quantization, and entropy coding—requiring no model parameter modification. A lightweight calibration step enables end-to-end transform coding. Contribution/Results: Evaluated on Llama 3, Mistral NeMo, and R1-Qwen 2.5, the method achieves an average 20× compression ratio, exceeding 40× for shared prefix contexts. Crucially, it preserves full inference accuracy and long-context performance, outperforming state-of-the-art KV cache optimization baselines across all metrics.

Technology Category

Application Category

📝 Abstract

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$ imes$ compression while maintaining reasoning and long-context accuracy, and 40$ imes$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

Problem

Research questions and friction points this paper is trying to address.

Compress KV caches to reduce GPU memory usage

Enable efficient storage for reusable KV caches

Maintain model accuracy while achieving high compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

KVTC applies PCA decorrelation and quantization

Method combines entropy coding with adaptive compression

Lightweight transform coder enables reusable KV caches

🔎 Similar Papers

No similar papers found.