Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high memory overhead of KV cache in large language model inference, where existing compression methods struggle to balance performance and computational cost. The authors propose a gating-based dynamic KV cache eviction approach that adaptively retains critical key-value pairs during both prefill and decoding stages. By introducing a lightweight sink-attention gating module and employing a task-agnostic reconstruction objective trained in a forward-only manner—without backpropagation—the method achieves high generality and minimal computational overhead. Experiments on models such as Qwen2.5-1M, Qwen3, and Gemma3 demonstrate that up to 70% of the KV cache can be pruned with negligible performance degradation, showing strong applicability across long-context processing, code understanding, and mathematical reasoning tasks.

Technology Category

Application Category

📝 Abstract

Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

LLM inference

efficient inference

cache management

performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache eviction

gating mechanism

frozen-weight LLMs

efficient inference

task-agnostic training

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

2024-07-16arXiv.orgCitations: 5

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

2024-06-18Citations: 0

Authors to Follow