🤖 AI Summary
This work addresses the substantial memory overhead of key-value (KV) caches in large language models, which escalates with context length and hinders deployment on resource-constrained devices. The authors introduce, for the first time, a training-free vector quantization technique for KV cache compression, mapping high-dimensional floating-point vectors to compact integer indices. This approach achieves significant memory reduction without degrading generation quality, overcoming the traditional trade-off between compression ratio and fidelity inherent in low-rank approximation and scalar quantization methods. Evaluated on LLaMA3.1-8B, the method attains an 82.8% KV cache compression rate with only a 1.4% drop in LongBench performance and enables a 4.3× longer generation length under the same memory budget.
📝 Abstract
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.