๐ค AI Summary
To address the high memory overhead and low decoding efficiency caused by linear KV cache growth in long-sequence reasoning, existing local-window-based importance estimation methods often neglect global semantics, leading to critical information loss. This paper proposes SoftKVโa lightweight, trainable soft-token mechanism that generates globally aware query vectors via fine-tuning only the embedding layer, replacing fixed local-window queries. Coupled with attention map alignment, SoftKV ensures the soft-token attention distribution matches that of actual decoded tokens, enabling end-to-end optimized KV cache eviction. Experiments on Llama-3.1-8B-Instruct and Mistral-7B show that, under identical eviction budgets, SoftKV improves LongBench by ~1.0 points and RULER by >3.0 points, significantly mitigating performance degradation. Moreover, SoftKV is compatible with mainstream open-source LLMs and incurs minimal training overhead.
๐ Abstract
Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.