Identifying Sensitive Weights via Post-quantization Integral

πŸ“… 2025-02-28
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing gradient- or Hessian-based sensitivity metrics in LLM post-training quantization suffer from severe underestimation of quantization-induced loss degradation due to excessively small convergence radii of local second-order approximations. Method: We propose PQI (Post-quantization Integral), the first high-fidelity sensitivity metric for post-quantization, which models the true impact of quantization perturbations via integral calculus. Built upon QTIP, we further introduce ReQuantβ€”a unified framework incorporating adaptive outlier selection, a Dense-and-Sparse detach mechanism, and stepwise decoupling of salient weights for fine-grained optimization. Contribution/Results: On Llama 3.2 1B, our method reduces perplexity by 2.66 over the state-of-the-art, significantly enhancing both robustness and accuracy of post-training quantization.

Technology Category

Application Category

πŸ“ Abstract
Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.
Problem

Research questions and friction points this paper is trying to address.

Accurate sensitivity metric for weight quantization
Improving post-training quantization for large language models
Enhancing memory and bandwidth efficiency in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-quantization Integral for accurate sensitivity estimation
ReQuant framework with Dense-and-Sparse detach components
Self-adaptive outlier selection and step-wise weight detach
πŸ”Ž Similar Papers
No similar papers found.
Yuezhou Hu
Yuezhou Hu
Tsinghua University
Weiyu Huang
Weiyu Huang
Tsinghua Unviersity
Efficient ML
Zichen Liang
Zichen Liang
Nankai University
Computer VisionEmbodied AIMLLMs
C
Chang Chen
Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
Jintao Zhang
Jintao Zhang
Tsinghua University
Efficient MLMlsysSystem for AIMachine LearningDataBase
J
Jun Zhu
Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
Jianfei Chen
Jianfei Chen
Associate Professor, Tsinghua University
Machine Learning