LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive KV cache memory overhead in long-context LLM inference and the limitations of existing compression methods—namely, their reliance on hand-crafted heuristics and lack of dynamic budget allocation—this paper proposes the first training-free hierarchical KV cache compression framework. Our method models information flow via Transformer residual streams to theoretically derive an attention output loss metric, enabling cross-layer and cross-head information fidelity analysis. We introduce, for the first time, a dual-level (layer-wise and head-wise) dynamic budget allocation scheme coupled with a unified dynamic eviction mechanism. Experiments reveal a task-dependent preference: generative tasks benefit more from dynamic layer-level budgeting, whereas extraction tasks rely critically on dynamic head-level budgeting. Our framework achieves significant improvements over state-of-the-art cache compression methods on LongBench, Needle-In-A-Haystack, RULER, and InfiniteBench.

Technology Category

Application Category

📝 Abstract
KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache memory usage in long-context LLM inference
Enabling dynamic budget allocation across layers and heads
Minimizing information loss during cache compression without training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimizes information loss in Transformer residual streams
Uses layer-wise compression with dynamic head budgets
Achieves dynamic layer budgets by contrasting cross-layer information
🔎 Similar Papers
No similar papers found.
Y
Yiqun Shen
State Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University
Song Yuan
Song Yuan
Zhejiang University, CAGE
Development EconomicsInternational EconomicsPolitical EconomyEconomic History
Z
Zhengze Zhang
State Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University
Xiaoliang Wang
Xiaoliang Wang
Associate Professor of Computer Science, Nanjing University
Networking System
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
N
Nguyen Cam-Tu
State Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University