FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-context inference, KV cache size grows linearly with sequence length, causing substantial GPU memory overhead and low retrieval efficiency; existing compression methods struggle to balance accuracy and speed. Method: We propose an algorithm-system co-optimization framework: (1) decoupling speculative KV retrieval from the critical path and introducing fine-grained correction, and (2) designing a CPU-GPU hybrid memory layout with a dual-buffer streaming recall mechanism. Contribution/Results: Evaluated across multiple models (e.g., Llama, Qwen) and tasks (long-document QA, code generation), our approach achieves near-lossless accuracy (ΔPPL < 0.1) while improving KV retrieval throughput by up to 13× over state-of-the-art methods—significantly alleviating the efficiency bottleneck in long-context inference.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$ imes$ speedup compared to SOTA KV retrieval methods.
Problem

Research questions and friction points this paper is trying to address.

Efficient KV cache retrieval for large language models
Reducing KV cache size without accuracy loss
Optimizing KV retrieval speed and system performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative retrieval shifts KV selection off critical path
Hybrid KV layouts eliminate fragmented data transfers
Double-buffered streamed recall enhances retrieval efficiency
🔎 Similar Papers
No similar papers found.