Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the trade-off between accuracy degradation and computational overhead in post-training quantization, this paper proposes an efficient quantization method grounded in parameter sensitivity analysis. Our approach integrates column-wise sensitivity clustering with a row-parallel quantization framework, coupled with a globally shared inverse Hessian matrix update mechanism to enable error compensation and low-complexity optimization—without iterative parameter updates. This design significantly mitigates accuracy loss under high compression ratios. Experiments on ResNet-50 and YOLOv5s demonstrate that our method achieves 20–200× faster quantization than Optimal Brain Quantization (OBQ), with average accuracy loss below 0.3%. The core innovations lie in a sensitivity-aware compensation mechanism and a scalable parallel quantization architecture, effectively balancing efficiency and accuracy for edge deployment.

Technology Category

Application Category

📝 Abstract

Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high compression ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes quantization of high-sensitivity parameters, leveraging unquantized low-sensitivity parameters to compensate for quantization errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel quantization framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold quantization speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method's efficacy in balancing efficiency and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in neural network quantization

Mitigates accuracy loss from high compression ratio quantization

Enables efficient quantization for edge computing scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sensitivity-aware quantization prioritizes high-sensitivity parameters

Row-parallel framework with shared Hessian reduces complexity

Unquantized low-sensitivity parameters compensate quantization errors

🔎 Similar Papers

No similar papers found.