🤖 AI Summary
In long-context inference, KV cache size grows linearly with sequence length, causing substantial GPU memory overhead and low retrieval efficiency; existing compression methods struggle to balance accuracy and speed. Method: We propose an algorithm-system co-optimization framework: (1) decoupling speculative KV retrieval from the critical path and introducing fine-grained correction, and (2) designing a CPU-GPU hybrid memory layout with a dual-buffer streaming recall mechanism. Contribution/Results: Evaluated across multiple models (e.g., Llama, Qwen) and tasks (long-document QA, code generation), our approach achieves near-lossless accuracy (ΔPPL < 0.1) while improving KV retrieval throughput by up to 13× over state-of-the-art methods—significantly alleviating the efficiency bottleneck in long-context inference.
📝 Abstract
Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$ imes$ speedup compared to SOTA KV retrieval methods.