🤖 AI Summary
To address the high memory overhead and latency caused by unbounded KV cache growth during autoregressive decoding of large language models (LLMs), this paper proposes Lethe, a dynamic KV cache management framework. Methodologically, Lethe introduces the first dual-dimensional adaptive pruning mechanism—operating jointly across layers (spatial) and decoding steps (temporal)—complemented by a layer-aware sparse allocation strategy and multi-round token importance estimation. Its core algorithm, Recency-Aware Selective Retention (RASR), jointly models attention redundancy and semantic relevance to retain critical tokens, overcoming the limitations of conventional recency-only heuristics. Evaluated across diverse LLMs and tasks, Lethe achieves up to 2.56× higher throughput while substantially reducing memory consumption—without compromising generation quality.
📝 Abstract
Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.