VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the substantial memory overhead of key-value (KV) caches in large language models, which escalates with context length and hinders deployment on resource-constrained devices. The authors introduce, for the first time, a training-free vector quantization technique for KV cache compression, mapping high-dimensional floating-point vectors to compact integer indices. This approach achieves significant memory reduction without degrading generation quality, overcoming the traditional trade-off between compression ratio and fidelity inherent in low-rank approximation and scalar quantization methods. Evaluated on LLaMA3.1-8B, the method attains an 82.8% KV cache compression rate with only a 1.4% drop in LongBench performance and enables a 4.3× longer generation length under the same memory budget.

Technology Category

Application Category

📝 Abstract

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

large language models

compression ratio

reconstruction fidelity

resource-limited deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

vector quantization

KV cache compression

training-free compression