🤖 AI Summary
This work addresses the inefficiency of bit allocation in post-training quantization of large language models (LLMs) by proposing WaterSIC, a novel scheme that introduces the reverse water-filling solution from information theory into LLM quantization. WaterSIC optimizes bit allocation under a weighted mean squared error criterion based on the column-wise covariance matrix of weights. Notably, it operates without reliance on any specific basis and integrates high-rate quantization analysis, weighted source coding, and stochastic rotation invariance to closely approach the information-theoretic distortion limit—achieving a gap of merely 0.25 bit per entry. Experiments demonstrate that GPTQ, when combined with random rotation, achieves performance within 0.1 bit of WaterSIC on Llama-3-8B, corroborating the near-optimality of WaterSIC in the high-rate regime.
📝 Abstract
This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs.
Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.