NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from substantial memory overhead due to KV cache storage, while existing vector quantization (VQ) methods for KV compression rely on task-specific calibration datasets and are vulnerable to distribution shifts. Method: We propose a calibration-free, low-bit KV cache compression framework. Its core innovation is a novel “normalize–shift–normalize” double-normalization mechanism synergized with Hadamard transformation, enabling robust alignment of KV caches to standard Gaussian distribution. This facilitates cross-task and cross-sequence-length generalization using a single shared codebook. The method integrates token-wise normalization, channel-wise centering, and low-bit VQ. Results: Our approach achieves state-of-the-art performance at 1-bit and 2-bit quantization—delivering higher compression ratios, lower accuracy degradation, and up to 3× throughput improvement over prior methods—while eliminating the need for calibration entirely.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3$ imes$ throughput gain over full-precision baselines.
Problem

Research questions and friction points this paper is trying to address.

Memory-intensive KV cache in LLM inference
Calibration-dependent existing VQ methods
Distribution shift in low-bit KV cache quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Double normalization for KV cache quantization
Calibration-free vector quantization technique
Hadamard transform aligns token distribution
🔎 Similar Papers
No similar papers found.
D
Donghyun Son
Seoul National University
E
Euntae Choi
Seoul National University
Sungjoo Yoo
Sungjoo Yoo
Seoul National University
memorystorage