Scaling Law for Quantization-Aware Training

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the lack of scalable theoretical guidance for 4-bit quantization-aware training (W4A4-QAT). We propose the first unified QAT scaling law, modeling quantization error as a function of model size, training token count, and quantization group size. Based on 268 large-scale QAT experiments and rigorous error decomposition analysis, we identify three key findings: (1) weight and activation quantization errors evolve asymmetrically; (2) activation outliers in the second fully connected layer (FC2) constitute the dominant error bottleneck in W4A4-QAT; and (3) quantization error decreases with larger models but increases with greater data volume and coarser grouping granularity. Furthermore, we empirically validate that mixed-precision strategies effectively balance these two error sources, yielding substantial performance gains for W4A4 models. Our results establish a principled foundation for designing and deploying efficient low-bit QAT systems, offering both theoretical insight and practical guidelines for scalable quantization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

Problem

Research questions and friction points this paper is trying to address.

Understanding scaling behavior of 4-bit quantization-aware training (QAT)

Modeling quantization error via size, data volume, and group size

Identifying and addressing bottlenecks in W4A4 QAT error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified scaling law for QAT modeling quantization error

Decompose W4A4 error into weight and activation components

Mixed-precision quantization addresses FC2 layer outliers

🔎 Similar Papers

No similar papers found.