A Simple Plug-in for Improving Eviction-Based KV Cache Compression

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the memory bottleneck of key-value (KV) caching in large language models during long-context reasoning, where existing methods struggle to balance information retention and compression efficiency due to binary eviction or coarse approximation strategies. The authors propose VECTOR, a plug-and-play KV cache enhancement module that introduces a novel three-way token routing mechanism—preserving, reconstructibly approximating, or evicting tokens—by integrating importance scores with offline-calibrated reconstructability signals. This enables finer-grained cache management and effectively recovers critical information permanently lost under conventional binary strategies. Empirical results demonstrate that VECTOR significantly improves the quality-memory trade-off at moderate to high compression ratios, with particularly pronounced gains under stringent memory constraints.
📝 Abstract
KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical for exact retention but are still reconstructable. We present VECTOR, a plug-and-play augmentation for eviction-based pipelines that introduces three-way token routing: retention, approximation, and eviction. VECTOR combines an importance signal from the base scorer with a reconstructability signal from an offline-calibrated regression-based value estimation. By leveraging reconstructability, VECTOR recovers useful value information that would otherwise be irreversibly lost under binary eviction, while preserving key vectors for attention routing stability. Experimental results show that VECTOR improves quality-memory trade-offs under medium-to-high compression, with especially clear gains in stricter budget regimes.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
long-context inference
token eviction
reconstructability
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression
token eviction
reconstructability
plug-and-play
long-context inference
🔎 Similar Papers
No similar papers found.