🤖 AI Summary
Existing post-training quantization methods often minimize reconstruction error using limited or non-representative calibration data, which can cause quantized weights to deviate significantly from their original counterparts and degrade generalization performance. This work proposes a unified framework that incorporates a saliency-aware regularizer into the standard quantization objective, thereby integrating weight saliency into the calibration process for the first time. This regularization steers quantized weights to align more closely with the original weights without introducing additional inference overhead. The approach seamlessly integrates with scale search and Gram matrix optimization strategies, consistently achieving substantial reductions in perplexity and improvements in zero-shot accuracy across both dense and mixture-of-experts large language models, thereby effectively enhancing the generalization capability of quantized models.
📝 Abstract
Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, usually optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing calibration objectives of PTQ based only on empirical reconstruction error on limited or unrepresentative calibration data could move the quantized weights away from the original weights. This may cause the generalization risk to diverge, potentially degrading downstream performance. To address this issue, we propose \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC) a unified framework that augments the standard PTQ objective with a saliency-aware regularization term. This term encourages quantized weights to stay close to the original weights during calibration, leading to improved generalization during inference. SARQC integrates seamlessly into existing PTQ pipelines, enhancing both scale search and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without additional computational overhead during inference.