MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of dynamic calibration and the low accuracy of static quantization in 4-bit quantization of large language models (LLMs) under long-sequence autoregressive inference, this paper proposes MergeQuant, a channel-wise static quantization framework. Its core contributions are: (1) Quantization Step Merging (QSM), which fuses quantization/dequantization operations with linear layer computation to eliminate redundant pre- and post-matrix-multiplication steps; and (2) dimension restructuring and adaptive clipping, which mitigate non-uniformity in per-channel scaling factors and improve parameter distribution balance. Evaluated on the Llama-2 series, MergeQuant achieves W4A4 quantization with only a 1.3-point zero-shot accuracy drop relative to FP16 for the 70B model, while accelerating decoding by 1.77× and end-to-end inference by 2.06× for the 7B model. The method establishes a new paradigm for high-accuracy, high-efficiency low-bit LLM deployment.

Technology Category

Application Category

📝 Abstract
Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
Problem

Research questions and friction points this paper is trying to address.

Reduces overhead in autoregressive generation inference
Improves accuracy in 4-bit static quantization
Enhances speed and efficiency in large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Per-channel static quantization framework
Quantization Step Migration method
Dimensional reconstruction and adaptive clipping
🔎 Similar Papers
No similar papers found.
J
Jingyu Wang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China; PengCheng Laboratory, Shenzhen, China
Haifeng Sun
Haifeng Sun
Associate Professor of Computer Science, Beijing University of Posts and Telecommunications
Natural language Processingintent based networkingNetAI
Tingting Yang
Tingting Yang
Professor, Peng Cheng Laboratory
Integrated Maritime NetworksNET4AICommunications and Computing Integrated Networks
Zirui Zhuang
Zirui Zhuang
Associate Professor, Beijing University of Posts and Telecommunications
Computer NetworkingComputer CommunicationsMachine LearningAI for NetworkNetwork for AI
W
Wanyi Ning
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Y
Yuexi Yin
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China; PengCheng Laboratory, Shenzhen, China
Q
Qi Qi
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
J
Jianxin Liao
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China