CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

📅 2025-03-16

📈 Citations: 2

✨ Influential: 1

🤖 AI Summary

Existing uniform layer-wise KV cache eviction schemes fail to accommodate inter-layer variations in attention patterns and temporal importance decay, leading to memory explosion in long-context reasoning for large language models. Method: We propose a hierarchical preference-driven cascaded adaptive cache eviction mechanism: (1) a novel layer-preference-aware global cache allocation paradigm modeled as dynamic “cake-cutting”-style resource partitioning; (2) a dynamic weighted eviction scoring metric integrating spatiotemporal attention analysis; and (3) a cascaded memory-constrained management framework enabling deep synergy with FlashAttention-2. Results: On LongBench and NeedleBench, our method retains full model performance using only 3.2% of the original KV cache, reduces decoding latency by over 10× for 128K-context inputs, and significantly outperforms state-of-the-art methods under stringent memory constraints.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a"cake-slicing problem."CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.

Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache eviction for large language models

Adapts cache allocation across layers with varying attention patterns

Reduces memory usage while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

CAKE optimizes KV cache eviction adaptively.

It allocates cache size based on layer preferences.

Employs a new eviction indicator for temporal dynamics.

🔎 Similar Papers

No similar papers found.

Authors to Follow