🤖 AI Summary
To address memory explosion and increased attention latency caused by linear KV cache growth with context length in large language model (LLM) inference, this paper proposes a query-agnostic KV cache compression method. The core innovation lies in quantifying the information contribution of each KV pair to the original context via the LLM’s autoregressive reconstruction capability—enabling importance estimation without requiring the current query. This facilitates cross-query reuse of compressed caches, significantly improving efficiency and stability in multi-query scenarios. The method integrates importance-aware KV quantization and selective eviction, and is fully compatible with FlashAttention. Evaluated on LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, it achieves 3–4× KV cache compression and ~2× decoding speedup, with negligible performance degradation across question answering, retrieval, reasoning, and code understanding tasks under 170K-context-length settings.
📝 Abstract
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4$ imes$ and FlashAttention decoding latency by approximately 2$ imes$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.