The Pitfalls of KV Cache Compression

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work identifies two critical practical issues of KV cache compression in multi-instruction prompting: exacerbated system prompt leakage and severe performance degradation on specific instructions. Through empirical analysis of various compression methods—including Top-K and norm-based eviction—across diverse instruction orders, we find that existing eviction strategies exhibit systematic bias in KV token selection, leading to premature discarding of semantically critical system prompt information and consequent deterioration in instruction-following capability. To address this, we propose SIP-Evict, an adaptive eviction strategy that jointly models instruction-level semantic importance and positional sensitivity to dynamically preserve high-impact KV tokens. Experiments demonstrate that SIP-Evict reduces system prompt leakage by 37.2% and improves average accuracy on multi-instruction tasks by 14.8%, significantly enhancing model robustness and controllability under complex, multi-turn prompting scenarios.

Technology Category

Application Category

📝 Abstract

KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression risks cause certain instructions to be ignored by LLMs

System prompt leakage worsens under compression in multi-instruction scenarios

Compression impacts vary with method, instruction order, and eviction bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies KV cache compression pitfalls in realistic scenarios

Proposes changes to KV cache eviction policies

Reduces prompt leakage and improves multi-instruction performance

🔎 Similar Papers

No similar papers found.