Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between accuracy degradation and computational overhead in post-training quantization, this paper proposes an efficient quantization method grounded in parameter sensitivity analysis. Our approach integrates column-wise sensitivity clustering with a row-parallel quantization framework, coupled with a globally shared inverse Hessian matrix update mechanism to enable error compensation and low-complexity optimization—without iterative parameter updates. This design significantly mitigates accuracy loss under high compression ratios. Experiments on ResNet-50 and YOLOv5s demonstrate that our method achieves 20–200× faster quantization than Optimal Brain Quantization (OBQ), with average accuracy loss below 0.3%. The core innovations lie in a sensitivity-aware compensation mechanism and a scalable parallel quantization architecture, effectively balancing efficiency and accuracy for edge deployment.

Technology Category

Application Category

📝 Abstract
Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high compression ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes quantization of high-sensitivity parameters, leveraging unquantized low-sensitivity parameters to compensate for quantization errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel quantization framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold quantization speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method's efficacy in balancing efficiency and accuracy.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in neural network quantization
Mitigates accuracy loss from high compression ratio quantization
Enables efficient quantization for edge computing scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sensitivity-aware quantization prioritizes high-sensitivity parameters
Row-parallel framework with shared Hessian reduces complexity
Unquantized low-sensitivity parameters compensate quantization errors
🔎 Similar Papers
No similar papers found.
Z
Zekang Zheng
South China University of Technology, Guangzhou, China
H
Haokun Li
South China University of Technology, Guangzhou, China
Yaofo Chen
Yaofo Chen
South China University of Technology
Large Language ModelsAutoMLModel AdaptationRobustness
Mingkui Tan
Mingkui Tan
South China University of Technology
Machine LearningLarge-scale Optimization
Q
Qing Du
South China University of Technology, Guangzhou, China