KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address memory explosion and increased attention latency caused by linear KV cache growth with context length in large language model (LLM) inference, this paper proposes a query-agnostic KV cache compression method. The core innovation lies in quantifying the information contribution of each KV pair to the original context via the LLM’s autoregressive reconstruction capability—enabling importance estimation without requiring the current query. This facilitates cross-query reuse of compressed caches, significantly improving efficiency and stability in multi-query scenarios. The method integrates importance-aware KV quantization and selective eviction, and is fully compatible with FlashAttention. Evaluated on LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, it achieves 3–4× KV cache compression and ~2× decoding speedup, with negligible performance degradation across question answering, retrieval, reasoning, and code understanding tasks under 170K-context-length settings.

Technology Category

Application Category

📝 Abstract

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4$ imes$ and FlashAttention decoding latency by approximately 2$ imes$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache memory overhead in LLMs

Improves attention latency via query-agnostic compression

Maintains performance across diverse tasks and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-agnostic KV cache compression method

Reconstructs context to quantify KV pair importance

Reduces cache size and latency significantly

🔎 Similar Papers

No similar papers found.