Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead and latency caused by unbounded KV cache growth during autoregressive decoding of large language models (LLMs), this paper proposes Lethe, a dynamic KV cache management framework. Methodologically, Lethe introduces the first dual-dimensional adaptive pruning mechanism—operating jointly across layers (spatial) and decoding steps (temporal)—complemented by a layer-aware sparse allocation strategy and multi-round token importance estimation. Its core algorithm, Recency-Aware Selective Retention (RASR), jointly models attention redundancy and semantic relevance to retain critical tokens, overcoming the limitations of conventional recency-only heuristics. Evaluated across diverse LLMs and tasks, Lethe achieves up to 2.56× higher throughput while substantially reducing memory consumption—without compromising generation quality.

Technology Category

Application Category

📝 Abstract
Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory and latency from KV cache accumulation in long LLM decoding sequences
Addresses dynamic layer-sensitive inefficiencies in long-form generative reasoning tasks
Improves throughput via adaptive spatial and temporal KV cache pruning mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layerwise sparsity-aware KV cache allocation per transformer
Multi-round token pruning using Recency-Aware Selective Retention
Dynamic KV cache management balancing efficiency and quality
🔎 Similar Papers
No similar papers found.
H
Hui Zeng
Xidian University, Xi’an, China
D
Daming Zhao
Tsinghua University, Beijing, China
Pengfei Yang
Pengfei Yang
Institute of Software, Chinese Academy of Sciences
Probabilistic model checkingDNN verification
Wenxuan Hou
Wenxuan Hou
Xidian University, Xi’an, China
T
Tianyang Zheng
Xidian University, Xi’an, China
H
Hui Li
Xidian University, Xi’an, China
W
Weiye Ji
Xidian University, Xi’an, China
Jidong Zhai
Jidong Zhai
Tsinghua University
Parallel ComputingCompilerProgramming ModelGPU