🤖 AI Summary
This work proposes an open-source, end-to-end hardware-aware model compression library that unifies multi-granularity pruning and high-granularity fixed-point quantization (HGQ) for the first time, enabling their joint or independent application through a consistent training interface. Designed for efficient neural network deployment on edge hardware under stringent latency constraints, the framework significantly reduces both model size and bit-width while preserving high accuracy. Evaluated in real-time edge computing scenarios—such as jet tagging tasks at the Large Hadron Collider—the approach demonstrates superior compression performance compared to existing tools like QKeras and HGQ, effectively balancing compression ratio and predictive fidelity.
📝 Abstract
PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.