🤖 AI Summary
This work investigates the impact of large language model (LLM) scaling on robustness to mixed-precision quantization. To address the challenge of adaptively configuring high-precision computation ratios and quantization granularity as models grow, we introduce *quantization ratio*—the proportion of high-precision operations—as a core metric. We conduct systematic post-training quantization experiments across model families and granularities (layer-level vs. operator-level, particularly matmul), jointly evaluating perplexity and downstream task accuracy. Our key findings are: (i) for every 10× increase in parameter count, the quantization ratio improves by 20–40% at fixed perplexity; (ii) larger models exhibit markedly improved compression–accuracy trade-offs under fine-grained matmul-level quantization, achieving no accuracy loss relative to layer-level quantization. These results establish a positive scaling law linking LLM size to mixed-precision quantization robustness—a foundational insight for efficient LLM deployment.
📝 Abstract
Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes? We first introduce a critical metric named the quantization ratio, which compares the number of parameters quantized to low-precision arithmetic against the total parameter count. Through extensive and carefully controlled experiments across different model families, arithmetic types, and quantization granularities (e.g. layer-wise, matmul-wise), we identify two central phenomenons. 1) The larger the models, the better they can preserve performance with an increased quantization ratio, as measured by perplexity in pre-training tasks or accuracy in downstream tasks. 2) The finer the granularity of mixed-precision quantization (e.g., matmul-wise), the more the model can increase the quantization ratio. We believe these observed phenomena offer valuable insights for future AI hardware design and the development of advanced Efficient AI algorithms.