Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing KV cache eviction methods rely on instantaneous heuristic metrics and overlook the heterogeneity among attention heads in their long-term semantic prediction capabilities, making it difficult to balance compression efficiency with accuracy. This work proposes LU-KV, a novel framework that, for the first time, employs long-term marginal utility as the criterion for cache allocation. It dynamically distributes per-head cache budgets through task-agnostic global combinatorial optimization, leveraging convex hull relaxation and a greedy solver to achieve near-optimal solutions. Offline data-driven configuration enables efficient deployment. Evaluated on LongBench and RULER benchmarks, LU-KV achieves 80% KV cache compression with negligible performance degradation, substantially reducing inference latency and GPU memory consumption.

Technology Category

Application Category

📝 Abstract
Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Extensive evaluations on LongBench and RULER benchmarks demonstrate that LU-KV achieves an 80% reduction in KV cache size with minimal performance degradation, while simultaneously reducing inference latency and GPU memory footprint.
Problem

Research questions and friction points this paper is trying to address.

KV cache eviction
attention heads heterogeneity
long-horizon utility
marginal utility
combinatorial optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache eviction
marginal utility
combinatorial optimization
attention heads
long-horizon utility
🔎 Similar Papers
No similar papers found.