LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

๐Ÿ“… 2025-04-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the deployment costโ€“accuracy trade-off caused by KV cache explosion in long-context reasoning with large language models, this paper proposes a lightweight, attention-free KV cache compression method. The core innovation is a novel importance criterion based on temporal lag relationships between key-value pairs, enabling gradient-free, non-intrusive cache importance estimation via purely comparative relative lag analysis and autoregressive local sensitivity modeling. Crucially, the method requires no modifications to the inference framework and incurs negligible computational overhead. On LongBench and PasskeyRetrieval benchmarks, 2ร— compression achieves near-lossless performance, while 8ร— compression retains approximately 90% of original accuracy. In 64-bit key retrieval tasks, it outperforms Hโ‚‚O by over 60% in accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modifiation of the inference infrastructure and significant computation overhead. Base on the fact that the Large Lanuage models are autoregresssive models, we propose {it LagKV}, a KV allocation strategy only relying on straight forward comparison among KV themself. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on LongBench and PasskeyRetrieval show that, our approach achieves nearly zero loss when the ratio is $2 imes$ and $approx 90%$ of the original model performance for $8 imes$. Especially in the 64-digit passkey retrieval task, our mehod outperforms the attention weight based method $H_2O$ over $60%$ with same compression ratios. Our code is available at url{https://github.com/AI-Lab-China-Merchants-Bank/LagKV}.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache size in LLMs for cost-accuracy balance
Avoiding attention-weight trade-offs in KV cache compression
Enhancing passkey retrieval performance with LagKV method
Innovation

Methods, ideas, or system contributions that make the work stand out.

LagKV uses KV cache comparison for token importance
Attention-free method for easy inference integration
Outperforms H2O in passkey retrieval by 60%
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Manlai Liang
AI Lab, China Merchants Bank, China
J
JiaMing Zhang
AI Lab, China Merchants Bank, China
Xiong Li
Xiong Li
University of Electronic Science and Technology of China
Information securityCryptography
J
Jinlong Li
AI Lab, China Merchants Bank, China