QERA: an Analytical Framework for Quantization Error Reconstruction

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing low-bit quantization methods for large language models (LLMs) suffer from inaccurate and non-analytic weight quantization error compensation. Method: This paper proposes the first analytically tractable quantization framework targeting activation output error reconstruction. Departing from conventional SVD-based approaches that solely optimize weight errors, our framework leverages matrix calculus and singular value perturbation theory to derive a closed-form optimal solution for output error compensation. It further introduces an output-sensitivity-driven low-rank reconstruction mechanism, enabling plug-and-play integration with mainstream post-training quantization (PTQ) methods such as LoftQ and ZeroQuant-V2. Contribution/Results: On 2-bit RoBERTa-base, our method improves GLUE accuracy by 6.05% over LoftQ; on 4-bit Llama-3.1-70B, it achieves a 2.97% average PTQ accuracy gain over ZeroQuant-V2; and on WikiText2, it reduces perplexity by 0.28 compared to LQER.

Technology Category

Application Category

📝 Abstract

The growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ and low-precision inference techniques including ZeroQuant-V2. Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA and LQER introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -- QERA achieves a fine-tuned accuracy gain of $Delta_{ ext{acc}}$ = 6.05% of 2-bit RoBERTa-base on GLUE compared to LoftQ; and obtains $Delta_{ ext{acc}}$ = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B on average than ZeroQuant-V2 and $Delta_{ ext{ppl}}$ = - 0.28 lower perplexity on WikiText2 than LQER.

Problem

Research questions and friction points this paper is trying to address.

Quantization error reconstruction in LLMs

Analytical framework for low-precision deployment

Closed-form solution for error minimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization Error Reconstruction Analysis

Closed-form solution framework

Enhances low-precision fine-tuning

🔎 Similar Papers

No similar papers found.