KeDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the KV cache memory explosion problem in long-context LLM inference on resource-constrained devices, this paper proposes a training-free, lightweight cache eviction method based on cosine similarity among key vectors. The method identifies and removes redundant keys by measuring pairwise angular similarity, thereby preserving semantically diverse keys. Its core contribution lies in the first empirical observation—and theoretical justification—that highly discriminative key vectors correlate with higher attention scores; further, the proposed greedy diversity-based pruning strategy is proven to maximize coverage of the key vector space. Crucially, the approach requires no access to real-time attention scores, remains compatible with efficient attention kernels (e.g., FlashAttention), and integrates seamlessly into standard Transformer decoders without architectural modification. Evaluated on LongBench, it achieves only a 0.04% performance drop while reducing KV cache memory by 23% (to 8K tokens), and has been validated across mainstream models including Llama 3.1-8B and 3.2-3B.

Technology Category

Application Category

📝 Abstract
In this work, we demonstrate that distinctive keys during LLM inference tend to have high attention scores. We explore this phenomenon and propose KeyDiff, a training-free KV cache eviction method based on key similarity. This method facilitates the deployment of LLM-based application requiring long input prompts in resource-constrained environments with limited memory and compute budgets. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We demonstrate that KeyDiff computes the optimal solution to a KV cache selection problem that maximizes key diversity, providing a theoretical understanding of KeyDiff. Notably,KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. We demonstrate the effectiveness of KeyDiff across diverse tasks and models, illustrating a performance gap of less than 0.04% with 8K cache budget ($sim$ 23% KV cache reduction) from the non-evicting baseline on the LongBench benchmark for Llama 3.1-8B and Llama 3.2-3B.
Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache eviction for long-context LLM inference
Enabling efficient LLM deployment in resource-constrained environments
Maximizing key diversity without relying on attention scores
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV cache eviction method
Maximizes key diversity in cache
Works with optimized attention mechanisms
🔎 Similar Papers
No similar papers found.
J
Junyoung Park
Qualcomm AI Research, San Diego, CA, USA
D
Dalton Jones
M
Matthew Morse
Raghavv Goel
Raghavv Goel
Qualcomm AI Research
efficient LLMsdeep learningreinforcement learningcontrol theory
Mingu Lee
Mingu Lee
Qualcomm AI Research
AIMLLLMSignal processing
C
Christopher Lott