FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the excessive GPU memory consumption and computational overhead caused by rapid KV cache expansion with increasing context and generation length in large language model (LLM) inference, this paper proposes a hierarchical KV cache management mechanism grounded in temporal stability analysis of attention heads. We empirically discover that different attention heads exhibit markedly heterogeneous temporal stability in attending to salient tokens—enabling dynamic classification into stable versus unstable heads. Leveraging this insight, our method implements fine-grained cache retention (top-K), CPU-GPU hierarchical storage, and periodic cache reordering. Integrated into the vLLM inference engine, our approach preserves model accuracy while achieving up to 70% reduction in GPU memory usage, 1.38–1.55× improvement in offline throughput, and 1.6–2.1× reduction in online latency.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory usage in LLM serving

Managing KV cache efficiently without accuracy loss

Exploiting temporal stability of attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal stability of KV heads for cache management

Classifies KV heads as stable or unstable for hierarchical caching

Offloads stable head KV-cache pages to host memory

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference