CommVQ: Commutative Vector Quantization for KV Cache Compression

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address the GPU memory bottleneck induced by KV caches in long-context reasoning with large language models (LLMs), this paper proposes an efficient and high-fidelity KV cache compression method. The core innovation is a RoPE-compatible additive vector quantization (AVQ) codebook architecture, integrated with a lightweight encoder and an expectation-maximization (EM)-based training algorithm, enabling seamless fusion of quantization with the self-attention mechanism. The method supports 1–2-bit quantization while incurring negligible accuracy degradation. On long-context benchmarks such as GSM8K, it achieves an 87.5% compression ratio for FP16 KV caches at 2 bits—substantially outperforming prior approaches. Furthermore, it enables deployment of the LLaMA-3.1 8B model on a single RTX 4090 GPU with full support for 128K context length, demonstrating both practical viability and scalability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.

Problem

Research questions and friction points this paper is trying to address.

Reduce KV cache memory usage in LLMs

Enable efficient long-context LLM inference

Achieve high accuracy with low-bit quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Commutative Vector Quantization for KV cache compression

Lightweight encoder and codebook for additive quantization

RoPE-commutative codebook for low overhead decoding

🔎 Similar Papers

No similar papers found.