CaliDrop: KV Cache Compression with Calibration

πŸ“… 2025-07-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the inference bottleneck in large language models (LLMs) caused by linear growth of KV cache memory with sequence length during long-context generation, this paper proposes CaliDropβ€”a dynamic calibration-based token eviction method. Its core innovation lies in performing speculative calibration prior to discarding less important KV entries, leveraging query vector similarity across adjacent positions to mitigate accuracy degradation. By jointly modeling attention sparsity and query similarity, CaliDrop enables fine-grained, adaptive KV selection. Experiments demonstrate that CaliDrop significantly improves generation accuracy under high compression ratios for diverse eviction strategies (e.g., Sink, StreamingLLM). On models such as Llama-3-8B, it retains over 98% of original performance while reducing KV cache memory consumption by 40%, effectively balancing inference efficiency and output quality.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose extbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.
Problem

Research questions and friction points this paper is trying to address.

KV cache memory bottleneck in LLMs
Accuracy loss from token eviction compression
Improving token eviction via calibration (CaliDrop)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances token eviction via calibration
Uses speculative calibration on discarded tokens
Improves accuracy in KV cache compression
πŸ”Ž Similar Papers
No similar papers found.
Y
Yi Su
School of Computer Science and Technology, Soochow University
Quantong Qiu
Quantong Qiu
Soochow University
LLMSparse AttentionKV Cache
Y
Yuechi Zhou
School of Computer Science and Technology, Soochow University
Juntao Li
Juntao Li
Soochow University
Language ModelsText Generation
Qingrong Xia
Qingrong Xia
Soochow University
NLP
P
Ping Li
Huawei Cloud
Xinyu Duan
Xinyu Duan
Huawei Cloud
LLMInference Optimization
Zhefeng Wang
Zhefeng Wang
Huawei Cloud
NLPAI systemLLMmulti-modalityMachine Learning
M
Min Zhang
School of Computer Science and Technology, Soochow University