CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial memory overhead of KV caching and the quadratic growth of per-token attention computation in long-context large language model inference. The authors propose a confidence-aware dynamic KV cache management method that adaptively allocates cache budget based on real-time prediction confidence: preserving more context when the model is uncertain and aggressively pruning tokens when confident. Coherence is maintained through attention accumulation, recency-based token ordering, and a protected local window. Key innovations include a hierarchical budgeting mechanism, mixed-precision (FP16/INT8) cache storage, and block-wise online Softmax attention. Experiments show that at 4K generation length, the approach achieves memory usage close to a 512-token sliding window with only a 1.5–2.1 perplexity increase; in 32K contexts, it attains 91.4% retrieval accuracy and matches 95.3% of full KV cache performance across 75 tasks while reducing peak memory by 2.8×.
📝 Abstract
Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.
Problem

Research questions and friction points this paper is trying to address.

KV cache eviction
long-horizon LLM inference
memory efficiency
attention mechanism
confidence-aware caching
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence-aware eviction
mixed-precision KV cache
dynamic cache budgeting
long-context LLM inference
online softmax attention