DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the high computational cost and overfitting risks associated with end-to-end optimization of rotation matrices in large language model (LLM) quantization, this paper proposes DartQuant, a distribution-aware efficient rotation calibration method. Methodologically, DartQuant introduces (i) a distribution constraint mechanism that directly regularizes the statistical properties of rotated activations—bypassing reliance on downstream task losses—and (ii) QR-Orth, a novel orthogonal optimization algorithm replacing conventional alternating optimization to substantially improve convergence speed and generalization. Notably, DartQuant achieves full-parameter rotation calibration of the LLaMA-2 70B model on a single RTX 3090 GPU—the first such result. Experiments demonstrate that, under 4-bit quantization, DartQuant delivers a 47× inference speedup and 10× memory reduction, significantly outperforming existing rotation-based quantization methods. This advancement enables efficient deployment of large models in resource-constrained environments.

Technology Category

Application Category

📝 Abstract

Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$ imes$ acceleration and 10$ imes$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at https://github.com/CAS-CLab/DartQuant.git.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of rotational optimization in LLM quantization

Mitigates overfitting risk by minimizing task-specific loss dependency

Enables large model quantization on resource-constrained hardware environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient distribution-aware rotational calibration method

QR-Orth optimization replaces expensive alternating optimization

Constrains activation distribution to reduce rotational complexity

🔎 Similar Papers

SpinQuant: LLM quantization with learned rotations

2024-05-26arXiv.orgCitations: 39

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow