DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the significant memory bottleneck in long-context large language model inference, where KV cache memory grows linearly with sequence length. The authors propose a layer-aware KV cache pruning framework that, for the first time, reveals the varying sensitivity of different network layers to pruning. Leveraging this insight, they design a non-uniform, layer-dependent cache allocation strategy that evaluates each layer’s sensitivity based on attention scores and optimally distributes a global cache budget accordingly. Under identical pruning ratios, this approach substantially outperforms uniform pruning strategies across diverse models and tasks, simultaneously enhancing inference efficiency and preserving model performance.

Technology Category

Application Category

📝 Abstract

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.

Problem

Research questions and friction points this paper is trying to address.

KV cache pruning

long-context LLM inference

layer-dependent allocation

memory bottleneck

attention mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

layer-dependent pruning

KV cache

long-context LLM