KeDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the KV cache memory explosion problem in long-context LLM inference on resource-constrained devices, this paper proposes a training-free, lightweight cache eviction method based on cosine similarity among key vectors. The method identifies and removes redundant keys by measuring pairwise angular similarity, thereby preserving semantically diverse keys. Its core contribution lies in the first empirical observation—and theoretical justification—that highly discriminative key vectors correlate with higher attention scores; further, the proposed greedy diversity-based pruning strategy is proven to maximize coverage of the key vector space. Crucially, the approach requires no access to real-time attention scores, remains compatible with efficient attention kernels (e.g., FlashAttention), and integrates seamlessly into standard Transformer decoders without architectural modification. Evaluated on LongBench, it achieves only a 0.04% performance drop while reducing KV cache memory by 23% (to 8K tokens), and has been validated across mainstream models including Llama 3.1-8B and 3.2-3B.

Technology Category

Application Category

📝 Abstract

In this work, we demonstrate that distinctive keys during LLM inference tend to have high attention scores. We explore this phenomenon and propose KeyDiff, a training-free KV cache eviction method based on key similarity. This method facilitates the deployment of LLM-based application requiring long input prompts in resource-constrained environments with limited memory and compute budgets. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We demonstrate that KeyDiff computes the optimal solution to a KV cache selection problem that maximizes key diversity, providing a theoretical understanding of KeyDiff. Notably,KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. We demonstrate the effectiveness of KeyDiff across diverse tasks and models, illustrating a performance gap of less than 0.04% with 8K cache budget ($sim$ 23% KV cache reduction) from the non-evicting baseline on the LongBench benchmark for Llama 3.1-8B and Llama 3.2-3B.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache eviction for long-context LLM inference

Enabling efficient LLM deployment in resource-constrained environments

Maximizing key diversity without relying on attention scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV cache eviction method

Maximizes key diversity in cache

Works with optimized attention mechanisms

🔎 Similar Papers

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models