🤖 AI Summary
To address the excessive memory overhead of KV caches in long-context reasoning with large language models, this paper proposes a novel low-bit KV quantization method that eliminates explicit normalization. The core contribution is the discovery that random matrix preprocessing concentrates the angular distribution of KV embeddings in polar coordinates—enabling direct, analytically tractable quantization of the polar angle alone. This bypasses the need to store full-precision zero-point and scaling parameters required by conventional quantization schemes. The method comprises three stages: random preprocessing, recursive polar coordinate transformation, and angular quantization. Experiments demonstrate over 4.2× KV cache compression, significantly outperforming state-of-the-art quantization approaches on long-text tasks while preserving inference quality.
📝 Abstract
Large language models (LLMs) require significant memory to store Key-Value (KV) embeddings in their KV cache, especially when handling long-range contexts. Quantization of these KV embeddings is a common technique to reduce memory consumption. This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation. Our method transforms the KV embeddings into polar coordinates using an efficient recursive algorithm and then quantizes resulting angles. Our key insight is that, after random preconditioning, the angles in the polar representation exhibit a tightly bounded and highly concentrated distribution with an analytically computable form. This nice distribution eliminates the need for explicit normalization, a step required by traditional quantization methods which introduces significant memory overhead because quantization parameters (e.g., zero point and scale) must be stored in full precision per each data block. PolarQuant bypasses this normalization step, enabling substantial memory savings. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over x4.2 while achieving the best quality scores compared to the state-of-the-art methods.