Learning to Evict from Key-Value Cache

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the substantial memory overhead of key-value (KV) caching in large language model inference, where existing heuristic eviction strategies struggle to accurately estimate the future utility of tokens and incur additional computational costs. The paper introduces the first reinforcement learning formulation of KV cache eviction, proposing KVP—a lightweight per-head policy network that learns to rank cached tokens based on their future decoding utility directly from key-value vectors. Without modifying the underlying model architecture or increasing inference latency, KVP enables adaptive and efficient cache eviction. Evaluated under a unified framework with varying cache budgets, the method significantly outperforms existing approaches on RULER and OASST2-4k benchmarks, while demonstrating strong zero-shot generalization and long-context handling capabilities on downstream tasks such as LongBench and BOOLQ.

Technology Category

Application Category

📝 Abstract

The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

Problem

Research questions and friction points this paper is trying to address.

KV cache eviction

Large Language Models

inference efficiency

memory management

token utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache eviction

reinforcement learning

large language models