🤖 AI Summary
In foundation model (FM) quantization, weight outliers hinder simultaneous achievement of high accuracy and hardware efficiency: mixed-precision schemes preserve accuracy at the cost of efficiency, while uniform low-bit quantization degrades accuracy. Method: We propose Pruning-Enhanced fine-grained scaling Quantization (PEQ), introducing the first “pruning-assisted, outlier-aware micro-scaling” paradigm—agnostic to outlier locality assumptions and applicable across diverse FMs. PEQ employs structured pruning to identify and isolate outliers, dynamically reallocating bit resources to preserve their high precision while increasing compression rates for regular weights. Integrated with a multi-precision INT compute unit and a custom NoC architecture (ReCoN), it jointly optimizes accuracy, memory footprint, and hardware efficiency. Results: Experiments demonstrate state-of-the-art accuracy across multiple quantization configurations, up to 3× faster inference, and 50% lower energy consumption.
📝 Abstract
Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of multi-precision INT processing elements and a network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike prior techniques, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across diverse quantization settings demonstrate that MicroScopiQ achieves state-of-the-art quantization accuracy, while delivering up to 3x faster inference and 2x lower energy consumption compared to existing alternatives. Code is available at: https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization