Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the high memory overhead and low decoding efficiency caused by linear KV cache growth in long-sequence reasoning, existing local-window-based importance estimation methods often neglect global semantics, leading to critical information loss. This paper proposes SoftKV—a lightweight, trainable soft-token mechanism that generates globally aware query vectors via fine-tuning only the embedding layer, replacing fixed local-window queries. Coupled with attention map alignment, SoftKV ensures the soft-token attention distribution matches that of actual decoded tokens, enabling end-to-end optimized KV cache eviction. Experiments on Llama-3.1-8B-Instruct and Mistral-7B show that, under identical eviction budgets, SoftKV improves LongBench by ~1.0 points and RULER by >3.0 points, significantly mitigating performance degradation. Moreover, SoftKV is compatible with mainstream open-source LLMs and incurs minimal training overhead.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache eviction to reduce memory usage

Addressing neglect of global information in cache management

Improving decoding efficiency while maintaining performance quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trainable queries optimize KV cache eviction

Soft token list captures global information effectively

Low-cost embedding layer tuning enhances decoding quality

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference