InfoQ: Mixed-Precision Quantization via Global Information Flow

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Bit allocation for mixed-precision quantization of deep neural networks on resource-constrained devices is a computationally intractable combinatorial optimization problem; existing methods fail to capture the global error propagation effect due to reliance on local sensitivity metrics (e.g., Hessian) or prohibitively expensive search strategies. Method: We propose a training-free, global information-flow-aware quantization framework that measures each layer’s impact on end-to-end information flow via mutual information change induced by quantization during a single forward pass—replacing local sensitivity with a global, information-theoretic proxy—and formulates bit allocation as an integer linear program. Results: On ImageNet, our method achieves up to 1% top-1 accuracy gain under 14× and 10.66× model compression for MobileNetV2 and ResNet18, respectively, while reducing search overhead by two orders of magnitude compared to state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase. InfoQ assesses layer sensitivity by quantizing each layer at different bit-widths and measuring, through a single forward pass, the resulting change in mutual information in the subsequent layers. This quantifies how much each layer quantization impacts the network information flow. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14X and 10.66X).

Problem

Research questions and friction points this paper is trying to address.

Optimize bit-width allocation for efficient neural network deployment

Measure layer sensitivity via global information flow impact

Achieve high compression rates with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free bit-width search via mutual information

Integer linear programming for bit-width allocation

Single forward pass quantifies layer sensitivity

🔎 Similar Papers

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip