Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory overhead of KV cache in large language model inference, where existing compression methods struggle to balance performance and computational cost. The authors propose a gating-based dynamic KV cache eviction approach that adaptively retains critical key-value pairs during both prefill and decoding stages. By introducing a lightweight sink-attention gating module and employing a task-agnostic reconstruction objective trained in a forward-only manner—without backpropagation—the method achieves high generality and minimal computational overhead. Experiments on models such as Qwen2.5-1M, Qwen3, and Gemma3 demonstrate that up to 70% of the KV cache can be pruned with negligible performance degradation, showing strong applicability across long-context processing, code understanding, and mathematical reasoning tasks.

Technology Category

Application Category

📝 Abstract
Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
LLM inference
efficient inference
cache management
performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache eviction
gating mechanism
frozen-weight LLMs
efficient inference
task-agnostic training