VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial memory overhead of key-value (KV) caches in large language models, which escalates with context length and hinders deployment on resource-constrained devices. The authors introduce, for the first time, a training-free vector quantization technique for KV cache compression, mapping high-dimensional floating-point vectors to compact integer indices. This approach achieves significant memory reduction without degrading generation quality, overcoming the traditional trade-off between compression ratio and fidelity inherent in low-rank approximation and scalar quantization methods. Evaluated on LLaMA3.1-8B, the method attains an 82.8% KV cache compression rate with only a 1.4% drop in LongBench performance and enables a 4.3× longer generation length under the same memory budget.

Technology Category

Application Category

📝 Abstract
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
large language models
compression ratio
reconstruction fidelity
resource-limited deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

vector quantization
KV cache compression
training-free compression
large language models
high-fidelity compression
🔎 Similar Papers
No similar papers found.
Y
Yixuan Wang
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University; Shanghai Innovation Institute; Shanghai Artificial Intelligence Laboratory
Qingyu Shi
Qingyu Shi
Peking University
computer visiondiffusionmultimodal
Jiayu Zhou
Jiayu Zhou
University of Michigan
Machine LearningAI + Health Informatics
Dianbo Liu
Dianbo Liu
Assistant professor, National University of Singapore
Push the limits of humanmachine learningbiomedical sciences
Ziwei He
Ziwei He
Shanghai Jiao Tong University
Machine Learning
Z
Zhouhan Lin
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University; Shanghai Innovation Institute; Shanghai Artificial Intelligence Laboratory