π€ AI Summary
Existing low-precision quantization methods for sub-microsecond neural network inference on FPGAs often incur substantial accuracy degradation, while conventional mixed-precision approaches suffer from coarse granularity and inflexible optimization.
Method: This paper proposes a gradient-based fine-grained mixed-precision quantization method. It is the first to model bitwidth as a learnable parameter embedded within a quantization-aware training framework, enabling layer-wise independent and differentiable bitwidth assignment for weights and activations. The method jointly optimizes bitwidth configurations under FPGA/ASIC hardware constraints.
Contribution/Results: Experiments demonstrate that, without sacrificing model accuracy, the approach reduces hardware resource consumption to 5% of the baseline and compresses end-to-end inference latency to 20%βachieving a fivefold speedup. This significantly enhances deployment efficiency of on-chip neural networks for ultra-low-latency applications.
π Abstract
Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method that could fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations with an arbitrary number of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.