PQuantML: A Tool for End-to-End Hardware-aware Model Compression

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an open-source, end-to-end hardware-aware model compression library that unifies multi-granularity pruning and high-granularity fixed-point quantization (HGQ) for the first time, enabling their joint or independent application through a consistent training interface. Designed for efficient neural network deployment on edge hardware under stringent latency constraints, the framework significantly reduces both model size and bit-width while preserving high accuracy. Evaluated in real-time edge computing scenarios—such as jet tagging tasks at the Large Hadron Collider—the approach demonstrates superior compression performance compared to existing tools like QKeras and HGQ, effectively balancing compression ratio and predictive fidelity.
📝 Abstract
PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.
Problem

Research questions and friction points this paper is trying to address.

model compression
hardware-aware deployment
pruning
quantization
latency constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

hardware-aware compression
end-to-end model compression
pruning and quantization
high-granularity quantization
fixed-point quantization
🔎 Similar Papers
No similar papers found.
R
Roope Niemi
European Center for Nuclear Research (CERN), CH-1211 Geneva, Switzerland
A
Anastasiia Petrovych
European Center for Nuclear Research (CERN), CH-1211 Geneva, Switzerland
A
Arghya Ranjan Das
Purdue University, West Lafayette, IN 47907, USA
E
Enrico Lupi
European Center for Nuclear Research (CERN), CH-1211 Geneva, Switzerland
C
Chang Sun
California Institute of Technology, Pasadena, CA 91125, United States
Dimitrios Danopoulos
Dimitrios Danopoulos
CERN
Machine LearningDeep LearningComputer ArchitectureHardware Acceleration
M
Marlon Joshua Helbing
University of Padova, Italy
Mia Liu
Mia Liu
Purdue University
particle physicsmachine learning
S
Sebastian Dittmeier
Physikalisches Institut, Heidelberg University, Germany
M
Michael Kagan
SLAC National Accelerator Laboratory, Menlo Park, USA
Vladimir Loncar
Vladimir Loncar
CERN
Maurizio Pierini
Maurizio Pierini
CERN
Particle PhysicsMachine Learning