🤖 AI Summary
To address the linear growth of KV cache memory with context length in long conversations—and the limitations of existing compression methods, which either cause uncontrolled peak memory usage or loss of contextual coherence—this paper proposes Episodic KV Compression. Our method organizes dialogue history into coherent “episodic segments” via training-free chunked prefilling and clustering-driven semantic segmentation. Leveraging layer-wise sensitivity analysis and dynamic memory budget allocation, it enables segment-level, query-agnostic cache eviction. Crucially, this avoids both the peak memory surge from full-context prefilling and the degradation of multi-turn coherence caused by query-dependent eviction strategies. Evaluated on three long-context question-answering benchmarks, our approach achieves up to 40% accuracy improvement, maintains near-full-cache performance at 4–6× compression ratios, and reduces inference latency and memory footprint by up to 2.4× and 3.5×, respectively.
📝 Abstract
Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.