π€ AI Summary
This work addresses the challenges of distribution shift, high latency, and performance degradation in existing long-context large language models during inference, particularly due to inefficient KV cache retrieval. The authors propose a GPU-native, efficient KV cache retrieval framework that, for the first time, enables on-demand top-k retrieval robust to distribution shifts. Their approach integrates collision-based candidate selection with quantized inner-product reranking and leverages Unified Virtual Addressing (UVA) to support CPU offloading. Evaluated on contexts with up to one million tokens, the method reduces decoding latency by 17Γ and 44Γ compared to MagicPIG and PQCache, respectively, achieves 2.8Γ the throughput of full attention, and even surpasses full attention in speed at batch size 1, significantly enhancing inference efficiency and scalability.
π Abstract
KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines.