CBQ: Cross-Block Quantization for Large Language Models

📅 2023-12-13
🏛️ arXiv.org
📈 Citations: 10
Influential: 5
📄 PDF
🤖 AI Summary
Existing post-training quantization (PTQ) methods for large language models (LLMs) model outliers only within individual layers or blocks, neglecting cross-module dependencies—leading to error accumulation and severe performance degradation under ultra-low-bit quantization. To address this, we propose the first cross-module collaborative reconstruction framework tailored for LLMs. Our approach features: (1) cross-block dependency modeling and homologous reconstruction to mitigate inter-module error propagation; (2) a coarse-to-fine preprocessing (CFP) strategy that enhances weight distribution robustness at extremely low bit-widths; and (3) adaptive LoRA-Rounding—a gradient-aware weight quantization technique operating entirely within the PTQ paradigm. Extensive experiments demonstrate state-of-the-art performance across challenging settings including W4A4, W4A8, and W2A16. On LLaMA-65B, our method achieves 4-bit quantization in just 4.3 hours on a single GPU, delivering both high accuracy and computational efficiency.
📝 Abstract
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addresses cross-block dependency in quantization
Minimizes error accumulation in low-bit settings
Improves quantization accuracy and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-block reconstruction-based PTQ
Coarse-to-fine preprocessing strategy
Adaptive LoRA-Rounding technique
🔎 Similar Papers
No similar papers found.
X
Xin Ding
University of Science and Technology of China, Huawei Noah’s Ark Lab
X
Xiaoyu Liu
University of Science and Technology of China, Huawei Noah’s Ark Lab
Zhijun Tu
Zhijun Tu
Huawei Noah's Ark Lab
Efficient LLM and AIGC systemModel Compression
Y
Yun-feng Zhang
DSA Thrust, INFO Hub, Hong Kong University of Science and Technology (GZ)
W
Wei Li
Huawei Noah’s Ark Lab
J
Jie Hu
Huawei Noah’s Ark Lab
Hanting Chen
Hanting Chen
Noah's Ark Lab, Huawei
deep learningmachine learningcomputer vision
Yehui Tang
Yehui Tang
Shanghai Jiao Tong University
Machine LearningQuantum AI & AI4Science
Zhiwei Xiong
Zhiwei Xiong
University of Science and Technology of China
computational photographybiomedical image analysis
B
Baoqun Yin
University of Science and Technology of China
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision