SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity

📅 2023-11-23
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address severe accuracy degradation and hardware-specific dependencies in ultra-low-bit (1–4 bit) quantization, this paper proposes a system-coordinated noise-injection quantization training framework. During training, hardware-calibrated discrete noise is injected to align quantized arithmetic with target deployment platforms. An end-to-end differentiable per-channel dynamic mixed-precision mechanism is introduced, requiring only two precision tiers (1–4 bits and 4–8 bits) to achieve optimal accuracy–efficiency trade-offs. The framework enables INT1–INT4 inference on commodity CPUs/GPUs without custom runtimes. It achieves 16× and 7× model compression on CNNs and Transformers, respectively; delivers 7.3× CPU speedup over INT8 and up to 6.3× GPU acceleration over FP16 (on vector units); and matches or exceeds full-precision accuracy on key tasks—marking the first demonstration of accuracy *surpassing* full-precision baselines under ultra-low-bit quantization on commercial hardware.
📝 Abstract
Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit range and one in the 4--8-bit range -- suffice in practice; at inference, each channel selects one of the two, keeping kernels simple and fast. To our knowledge, SONIQ is the first framework to reach or surpass full-precision accuracy under ultra-low (1--4 bits per parameter) regimes while remaining deployable on commodity hardware, narrowing the gap between quantization theory and practical, high-throughput inference.
Problem

Research questions and friction points this paper is trying to address.

Ultra-low-precision inference reduces memory but degrades accuracy
Specialized hardware requirements limit practical quantization deployment
Gap exists between quantization theory and high-throughput inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns per-channel mixed precision for weights and activations
Injects hardware-calibrated quantization noise during training
Uses two precision levels selectable per channel at inference
🔎 Similar Papers
No similar papers found.
C
Cyrus Zhou
Department of Computer Science, Stanford University, CA, USA
P
Pedro H. P. Savarese
TTI-Chicago, Chicago, IL, USA
Z
Zack Hassman
Department of Computer Science, University of Chicago, Chicago, IL, USA
V
Vaughn Richard
Department of Computer Science, University of Chicago, Chicago, IL, USA
M
Michael DiBrino
FutureWei Technologies, Austin, TX, USA
Michael Maire
Michael Maire
University of Chicago
Computer VisionDeep Learning
Y
Yanjing Li
Department of Computer Science, University of Chicago, Chicago, IL, USA