🤖 AI Summary
To address severe accuracy degradation and hardware-specific dependencies in ultra-low-bit (1–4 bit) quantization, this paper proposes a system-coordinated noise-injection quantization training framework. During training, hardware-calibrated discrete noise is injected to align quantized arithmetic with target deployment platforms. An end-to-end differentiable per-channel dynamic mixed-precision mechanism is introduced, requiring only two precision tiers (1–4 bits and 4–8 bits) to achieve optimal accuracy–efficiency trade-offs. The framework enables INT1–INT4 inference on commodity CPUs/GPUs without custom runtimes. It achieves 16× and 7× model compression on CNNs and Transformers, respectively; delivers 7.3× CPU speedup over INT8 and up to 6.3× GPU acceleration over FP16 (on vector units); and matches or exceeds full-precision accuracy on key tasks—marking the first demonstration of accuracy *surpassing* full-precision baselines under ultra-low-bit quantization on commercial hardware.
📝 Abstract
Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit range and one in the 4--8-bit range -- suffice in practice; at inference, each channel selects one of the two, keeping kernels simple and fast. To our knowledge, SONIQ is the first framework to reach or surpass full-precision accuracy under ultra-low (1--4 bits per parameter) regimes while remaining deployable on commodity hardware, narrowing the gap between quantization theory and practical, high-throughput inference.