Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip

📅 2024-05-01

📈 Citations: 7

✨ Influential: 1

career value

189K/year

🤖 AI Summary

Existing low-precision quantization methods for sub-microsecond neural network inference on FPGAs often incur substantial accuracy degradation, while conventional mixed-precision approaches suffer from coarse granularity and inflexible optimization. Method: This paper proposes a gradient-based fine-grained mixed-precision quantization method. It is the first to model bitwidth as a learnable parameter embedded within a quantization-aware training framework, enabling layer-wise independent and differentiable bitwidth assignment for weights and activations. The method jointly optimizes bitwidth configurations under FPGA/ASIC hardware constraints. Contribution/Results: Experiments demonstrate that, without sacrificing model accuracy, the approach reduces hardware resource consumption to 5% of the baseline and compresses end-to-end inference latency to 20%—achieving a fivefold speedup. This significantly enhances deployment efficiency of on-chip neural networks for ultra-low-latency applications.

Technology Category

Application Category

📝 Abstract

Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method that could fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations with an arbitrary number of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

Problem

Research questions and friction points this paper is trying to address.

Optimizes neural network parameter bit-widths via gradient descent

Enables sub-microsecond inference latency on FPGA hardware

Reduces resource consumption while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

HGQ optimizes parameter bit-widths via gradient descent

HGQ independently determines optimal bit-width for each parameter

HGQ enables sub-microsecond inference latency on FPGAs

🔎 Similar Papers

SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity