DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and overfitting risks associated with end-to-end optimization of rotation matrices in large language model (LLM) quantization, this paper proposes DartQuant, a distribution-aware efficient rotation calibration method. Methodologically, DartQuant introduces (i) a distribution constraint mechanism that directly regularizes the statistical properties of rotated activations—bypassing reliance on downstream task losses—and (ii) QR-Orth, a novel orthogonal optimization algorithm replacing conventional alternating optimization to substantially improve convergence speed and generalization. Notably, DartQuant achieves full-parameter rotation calibration of the LLaMA-2 70B model on a single RTX 3090 GPU—the first such result. Experiments demonstrate that, under 4-bit quantization, DartQuant delivers a 47× inference speedup and 10× memory reduction, significantly outperforming existing rotation-based quantization methods. This advancement enables efficient deployment of large models in resource-constrained environments.

Technology Category

Application Category

📝 Abstract
Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$ imes$ acceleration and 10$ imes$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at https://github.com/CAS-CLab/DartQuant.git.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of rotational optimization in LLM quantization
Mitigates overfitting risk by minimizing task-specific loss dependency
Enables large model quantization on resource-constrained hardware environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient distribution-aware rotational calibration method
QR-Orth optimization replaces expensive alternating optimization
Constrains activation distribution to reduce rotational complexity
🔎 Similar Papers
No similar papers found.
Y
Yuantian Shao
Nanjing University of Science and Technology, C2DL, Institute of Automation, Chinese Academy of Sciences
Y
Yuanteng Chen
C2DL, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Zhongguancun Academy
Peisong Wang
Peisong Wang
CASIA
Deep Neural Network Acceleration and Compression
J
Jianlin Yu
Huawei Technologies Co., Ltd.
J
Jing Lin
Huawei Technologies Co., Ltd.
Yiwu Yao
Yiwu Yao
Peking University
Artificial Intelligence
Z
Zhihui Wei
Nanjing University of Science and Technology
J
Jian Cheng
C2DL, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences