QERA: an Analytical Framework for Quantization Error Reconstruction

📅 2024-10-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing low-bit quantization methods for large language models (LLMs) suffer from inaccurate and non-analytic weight quantization error compensation. Method: This paper proposes the first analytically tractable quantization framework targeting activation output error reconstruction. Departing from conventional SVD-based approaches that solely optimize weight errors, our framework leverages matrix calculus and singular value perturbation theory to derive a closed-form optimal solution for output error compensation. It further introduces an output-sensitivity-driven low-rank reconstruction mechanism, enabling plug-and-play integration with mainstream post-training quantization (PTQ) methods such as LoftQ and ZeroQuant-V2. Contribution/Results: On 2-bit RoBERTa-base, our method improves GLUE accuracy by 6.05% over LoftQ; on 4-bit Llama-3.1-70B, it achieves a 2.97% average PTQ accuracy gain over ZeroQuant-V2; and on WikiText2, it reduces perplexity by 0.28 compared to LQER.

Technology Category

Application Category

📝 Abstract
The growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ and low-precision inference techniques including ZeroQuant-V2. Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA and LQER introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -- QERA achieves a fine-tuned accuracy gain of $Delta_{ ext{acc}}$ = 6.05% of 2-bit RoBERTa-base on GLUE compared to LoftQ; and obtains $Delta_{ ext{acc}}$ = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B on average than ZeroQuant-V2 and $Delta_{ ext{ppl}}$ = - 0.28 lower perplexity on WikiText2 than LQER.
Problem

Research questions and friction points this paper is trying to address.

Quantization error reconstruction in LLMs
Analytical framework for low-precision deployment
Closed-form solution for error minimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization Error Reconstruction Analysis
Closed-form solution framework
Enhances low-precision fine-tuning
🔎 Similar Papers
No similar papers found.
C
Cheng Zhang
Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Jeffrey T. H. Wong
Jeffrey T. H. Wong
Imperial College London
Efficient Machine LearningDeep Learning
C
Can Xiao
Department of Electrical and Electronic Engineering, Imperial College London, London, UK
G
G. Constantinides
Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Yiren Zhao
Yiren Zhao
University of Toronto
Computer NetworksOptical NetworksDatacenter Networks