Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Low-bit quantization of large language models (LLMs) is hindered by systematic outliers in activations and weights, while existing homogeneous transformation methods—e.g., affine or rotational transformations—ignore inter-layer distribution heterogeneity. To address this, we propose the first adaptive, layer-wise transformation selection framework: it models optimal transformation type per layer via weight kurtosis and introduces a robust, z-score-guided lightweight layer selection strategy. Furthermore, we jointly optimize affine and rotational transformations via differentiable learning to enable efficient, layer-customized quantization. Evaluated on LLaMA-family models, our method reduces perplexity by up to 4.58 points and improves zero-shot accuracy across six tasks by 2.11% over FlatQuant, significantly outperforming fixed-transformation baselines.

Technology Category

Application Category

📝 Abstract

Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and weights, which cause substantial LLM performance degradation, especially at low-bit settings. While existing transformation-based methods like affine and rotation transformations successfully mitigate outliers, they apply the homogeneous transformation setting, i.e., using the same transformation types across all layers, ignoring the heterogeneous distribution characteristics within LLMs. In this paper, we propose an adaptive transformation selection framework that systematically determines optimal transformations on a per-layer basis. To this end, we first formulate transformation selection as a differentiable optimization problem to achieve the accurate transformation type for each layer. However, searching for optimal layer-wise transformations for every model is computationally expensive. To this end, we establish the connection between weight distribution kurtosis and accurate transformation type. Specifically, we propose an outlier-guided layer selection method using robust $z$-score normalization that achieves comparable performance to differentiable search with significantly reduced overhead. Comprehensive experiments on LLaMA family models demonstrate that our adaptive approach consistently outperforms the widely-used fixed transformation settings. For example, our method achieves an improvement of up to 4.58 perplexity points and a 2.11% gain in average six-task zero-shot accuracy under aggressive W3A3K2V2 quantization settings for the LLaMA-3-8B model compared to the current best existing method, FlatQuant, demonstrating the necessity of heterogeneous transformation selection for optimal LLM quantization.

Problem

Research questions and friction points this paper is trying to address.

Addressing systematic outliers in activations and weights during LLM quantization

Selecting optimal per-layer transformations to handle heterogeneous distribution characteristics

Reducing computational overhead through outlier-guided layer selection for quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive transformation selection per layer

Differentiable optimization for transformation types

Outlier-guided layer selection using z-score normalization

🔎 Similar Papers

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices