KVCrush: Key value cache size-reduction using similarity in head-behaviour

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the high memory footprint of KV caches in large language model (LLM) inference—which limits batch size and throughput—this paper proposes a low-overhead cache compression method. Our approach introduces two key innovations: (1) a novel KV state re-representation mechanism based on attention head behavioral similarity, enabling cross-head redundancy elimination; and (2) a synergistic integration of distribution-aware token pruning and KV sparse re-encoding, fully compatible with vLLM’s paged memory management and mixed-precision quantization. Evaluated on LongBench, our method achieves 4× KV cache compression with <1% accuracy degradation and <0.5% increase in inference latency. The average accuracy matches or exceeds state-of-the-art methods, significantly outperforming existing cache discard, quantization, and approximation techniques.

Technology Category

Application Category

📝 Abstract

Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory footprint of KV caching in LLMs.

Maintains model accuracy with smaller KV cache size.

Compatible with existing KV compression and paging schemes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alternate key-value state representation scheme

Low-overhead token pruning algorithm

Compatible with KV cache paging schemes

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference