MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

📅 2024-11-08

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

238K/year

🤖 AI Summary

In foundation model (FM) quantization, weight outliers hinder simultaneous achievement of high accuracy and hardware efficiency: mixed-precision schemes preserve accuracy at the cost of efficiency, while uniform low-bit quantization degrades accuracy. Method: We propose Pruning-Enhanced fine-grained scaling Quantization (PEQ), introducing the first “pruning-assisted, outlier-aware micro-scaling” paradigm—agnostic to outlier locality assumptions and applicable across diverse FMs. PEQ employs structured pruning to identify and isolate outliers, dynamically reallocating bit resources to preserve their high precision while increasing compression rates for regular weights. Integrated with a multi-precision INT compute unit and a custom NoC architecture (ReCoN), it jointly optimizes accuracy, memory footprint, and hardware efficiency. Results: Experiments demonstrate state-of-the-art accuracy across multiple quantization configurations, up to 3× faster inference, and 50% lower energy consumption.

Technology Category

Application Category

📝 Abstract

Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of multi-precision INT processing elements and a network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike prior techniques, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across diverse quantization settings demonstrate that MicroScopiQ achieves state-of-the-art quantization accuracy, while delivering up to 3x faster inference and 2x lower energy consumption compared to existing alternatives. Code is available at: https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization

Problem

Research questions and friction points this paper is trying to address.

Addressing outlier challenges in foundational model quantization

Balancing accuracy and hardware efficiency in quantization

Enabling broad applicability without outlier weight locality assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pruning complements outlier-aware quantization for efficiency

Multi-precision INT processing elements enhance hardware throughput

ReCoN NoC abstracts high-precision outlier complexity

🔎 Similar Papers

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

2023-09-27International Conference on Learning RepresentationsCitations: 7

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

2024-10-03arXiv.orgCitations: 2

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Research Scientist, AI & Systems Co-design (PhD)