GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration

๐Ÿ“… 2025-04-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the problem of layer-wise accumulation of quantization error in post-training quantization of large language models, this paper proposes GPTQv2โ€”a highly efficient, fine-tuning-free compression method for Transformer architectures. Its core innovation is an asymmetric calibration paradigm, derived from Optimal Brain Compression theory, which yields a closed-form solution explicitly jointly optimizing both quantization error and accumulated asymmetric errorโ€”thereby ensuring strict fidelity between quantized layer outputs and full-precision counterparts. Leveraging channel-level parallelism, neuron-wise decomposition, and Cholesky-based matrix fusion, GPTQv2 achieves substantial computational speedup with only +20 lines of code. It enables single-GPU quantization of 405B-parameter language models and EVA-02 vision Transformers while retaining 90% of pre-trained ImageNet accuracy. This work establishes a lightweight, high-fidelity quantization paradigm for ultra-large-scale model deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce GPTQv2, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTQv2 is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02 the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at github.com/Intelligent-Computing-Lab-Yale/GPTQv2.
Problem

Research questions and friction points this paper is trying to address.

Develops finetuning-free quantization for large transformers
Reduces quantization error via asymmetric calibration
Enables efficient low-bit compression with minimal code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric calibration reduces quantization error
Optimal brain compression for closed-form solution
Parallelization techniques enhance implementation efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Yuhang Li
Yuhang Li
Yale University
Machine Learning
Ruokai Yin
Ruokai Yin
Yale University
Computer ArchitectureDomain-specific AccelerationDeep LearningNeuromorphic Computing
D
Donghyun Lee
Department of Electrical Engineering, Yale University
S
Shiting Xiao
Department of Electrical Engineering, Yale University
P
Priyadarshini Panda
Department of Electrical Engineering, Yale University