Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

๐Ÿ“… 2025-09-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high memory overhead and low decoding efficiency caused by linear KV cache growth in long-sequence reasoning, existing local-window-based importance estimation methods often neglect global semantics, leading to critical information loss. This paper proposes SoftKVโ€”a lightweight, trainable soft-token mechanism that generates globally aware query vectors via fine-tuning only the embedding layer, replacing fixed local-window queries. Coupled with attention map alignment, SoftKV ensures the soft-token attention distribution matches that of actual decoded tokens, enabling end-to-end optimized KV cache eviction. Experiments on Llama-3.1-8B-Instruct and Mistral-7B show that, under identical eviction budgets, SoftKV improves LongBench by ~1.0 points and RULER by >3.0 points, significantly mitigating performance degradation. Moreover, SoftKV is compatible with mainstream open-source LLMs and incurs minimal training overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache eviction to reduce memory usage
Addressing neglect of global information in cache management
Improving decoding efficiency while maintaining performance quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trainable queries optimize KV cache eviction
Soft token list captures global information effectively
Low-cost embedding layer tuning enhances decoding quality
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yijun Liu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin, China
Y
Yixuan Wang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin, China
Yuzhuang Xu
Yuzhuang Xu
Tsinghua University
Natural Language ProcessingEfficient AIMachine Learning
Shiyu Ji
Shiyu Ji
University of California, Santa Barbara
Information RetrievalPrivacySecurity
Y
Yang Xu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin, China
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing