NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Large language models (LLMs) suffer from substantial memory overhead due to KV cache storage, while existing vector quantization (VQ) methods for KV compression rely on task-specific calibration datasets and are vulnerable to distribution shifts. Method: We propose a calibration-free, low-bit KV cache compression framework. Its core innovation is a novel “normalize–shift–normalize” double-normalization mechanism synergized with Hadamard transformation, enabling robust alignment of KV caches to standard Gaussian distribution. This facilitates cross-task and cross-sequence-length generalization using a single shared codebook. The method integrates token-wise normalization, channel-wise centering, and low-bit VQ. Results: Our approach achieves state-of-the-art performance at 1-bit and 2-bit quantization—delivering higher compression ratios, lower accuracy degradation, and up to 3× throughput improvement over prior methods—while eliminating the need for calibration entirely.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3$ imes$ throughput gain over full-precision baselines.

Problem

Research questions and friction points this paper is trying to address.

Memory-intensive KV cache in LLM inference

Calibration-dependent existing VQ methods

Distribution shift in low-bit KV cache quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Double normalization for KV cache quantization

Calibration-free vector quantization technique

Hadamard transform aligns token distribution

🔎 Similar Papers

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization