🤖 AI Summary
Low-bit quantization of large language models (LLMs) is hindered by systematic outliers in activations and weights, while existing homogeneous transformation methods—e.g., affine or rotational transformations—ignore inter-layer distribution heterogeneity. To address this, we propose the first adaptive, layer-wise transformation selection framework: it models optimal transformation type per layer via weight kurtosis and introduces a robust, z-score-guided lightweight layer selection strategy. Furthermore, we jointly optimize affine and rotational transformations via differentiable learning to enable efficient, layer-customized quantization. Evaluated on LLaMA-family models, our method reduces perplexity by up to 4.58 points and improves zero-shot accuracy across six tasks by 2.11% over FlatQuant, significantly outperforming fixed-transformation baselines.
📝 Abstract
Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and weights, which cause substantial LLM performance degradation, especially at low-bit settings. While existing transformation-based methods like affine and rotation transformations successfully mitigate outliers, they apply the homogeneous transformation setting, i.e., using the same transformation types across all layers, ignoring the heterogeneous distribution characteristics within LLMs. In this paper, we propose an adaptive transformation selection framework that systematically determines optimal transformations on a per-layer basis. To this end, we first formulate transformation selection as a differentiable optimization problem to achieve the accurate transformation type for each layer. However, searching for optimal layer-wise transformations for every model is computationally expensive. To this end, we establish the connection between weight distribution kurtosis and accurate transformation type. Specifically, we propose an outlier-guided layer selection method using robust $z$-score normalization that achieves comparable performance to differentiable search with significantly reduced overhead. Comprehensive experiments on LLaMA family models demonstrate that our adaptive approach consistently outperforms the widely-used fixed transformation settings. For example, our method achieves an improvement of up to 4.58 perplexity points and a 2.11% gain in average six-task zero-shot accuracy under aggressive W3A3K2V2 quantization settings for the LLaMA-3-8B model compared to the current best existing method, FlatQuant, demonstrating the necessity of heterogeneous transformation selection for optimal LLM quantization.