Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the high memory overhead and latency caused by unbounded KV cache growth during autoregressive decoding of large language models (LLMs), this paper proposes Lethe, a dynamic KV cache management framework. Methodologically, Lethe introduces the first dual-dimensional adaptive pruning mechanism—operating jointly across layers (spatial) and decoding steps (temporal)—complemented by a layer-aware sparse allocation strategy and multi-round token importance estimation. Its core algorithm, Recency-Aware Selective Retention (RASR), jointly models attention redundancy and semantic relevance to retain critical tokens, overcoming the limitations of conventional recency-only heuristics. Evaluated across diverse LLMs and tasks, Lethe achieves up to 2.56× higher throughput while substantially reducing memory consumption—without compromising generation quality.

Technology Category

Application Category

📝 Abstract

Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory and latency from KV cache accumulation in long LLM decoding sequences

Addresses dynamic layer-sensitive inefficiencies in long-form generative reasoning tasks

Improves throughput via adaptive spatial and temporal KV cache pruning mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layerwise sparsity-aware KV cache allocation per transformer

Multi-round token pruning using Recency-Aware Selective Retention

Dynamic KV cache management balancing efficiency and quality

🔎 Similar Papers

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models