🤖 AI Summary
Addressing the challenge of balancing input adaptivity and memory overhead in neural network dynamic quantization, this paper proposes a sample-wise dynamic rescaling method based on probabilistic modeling. A lightweight proxy network models the pre-activation distribution to estimate optimal quantization parameters in real time, eliminating the need for explicit storage or repeated statistical computation. This work is the first to integrate probabilistic modeling into dynamic quantization frameworks, enabling input-adaptive quantization scheduling without additional GPU memory consumption. Evaluated on mainstream vision models—including ResNet and ViT—and tasks such as ImageNet classification and COCO object detection, our method incurs negligible accuracy degradation (<0.3% Top-1 drop) while significantly reducing computational overhead compared to existing dynamic quantization approaches. It achieves superior accuracy–efficiency trade-offs relative to both post-training quantization (PTQ) and quantization-aware training (QAT) baselines.
📝 Abstract
We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network's pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.