🤖 AI Summary
To address the degradation in post-training quantization (PTQ) performance of large language models (LLMs) on edge devices during small-batch inference—caused by heavy-tailed weight distributions—this paper proposes Q-Palette, a novel quantization framework. First, it introduces a learnable Gaussianization transform to mitigate outlier effects. Second, it designs a fine-grained, fractional-bit optimal bit-allocation strategy that jointly optimizes inter-layer bit budgets and operator fusion order. Third, it integrates rotation-based transforms, tree-structured encoding quantization, and custom CUDA kernels to enable mixed-precision quantization. To our knowledge, Q-Palette is the first PTQ method—without fine-tuning—to approach the rate-distortion theoretical lower bound for Gaussian sources. Evaluated on Llama and Phi models, it achieves 2–4-bit quantization with 50–75% memory reduction and 30–60% latency reduction, significantly enhancing efficiency for personalized LLM deployment on edge devices.
📝 Abstract
We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.