FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between severe accuracy degradation and high memory overhead in ultra-low-bit quantization of large language models (LLMs), this paper proposes a fine-grained mixed-precision quantization method. It partitions weight clusters based on intra-cluster outlier awareness to preserve outliers with 3-bit precision; introduces an index-data concatenation encoding scheme to improve memory access alignment; and designs a time-encoded dedicated accelerator that simplifies the systolic array architecture. The core innovations are the first intra-cluster fine-grained mixed-precision quantization mechanism and an outlier-sensitive encoding scheme. Experiments demonstrate that, at comparable average bitwidths, the proposed method achieves significantly higher model accuracy than state-of-the-art mixed-precision approaches. Moreover, the accelerator delivers a 1.79× improvement in energy efficiency and reduces systolic array area by 61.2%.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory consumption of LLMs. However, advanced single-precision quantization methods experience significant accuracy degradation when quantizing to ultra-low bits. Existing mixed-precision quantization methods are quantized by groups with coarse granularity. Employing high precision for group data leads to substantial memory overhead, whereas low precision severely impacts model accuracy. To address this issue, we propose FineQ, software-hardware co-design for low-bit fine-grained mixed-precision quantization of LLMs. First, FineQ partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters, thus achieving a balance between model accuracy and memory overhead. Then, we propose an outlier protection mechanism within clusters that uses 3 bits to represent outliers and introduce an encoding scheme for index and data concatenation to enable aligned memory access. Finally, we introduce an accelerator utilizing temporal coding that effectively supports the quantization algorithm while simplifying the multipliers in the systolic array. FineQ achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width. Meanwhile, the accelerator achieves up to 1.79x energy efficiency and reduces the area of the systolic array by 61.2%.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory demands of LLMs via fine-grained quantization
Balances accuracy and memory by clustering weight outliers
Enhances efficiency with hardware-optimized low-bit encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained clusters for mixed-precision quantization
Outlier protection mechanism with 3-bit encoding
Temporal coding accelerator simplifying multipliers
🔎 Similar Papers
No similar papers found.
X
Xilong Xie
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
L
Liang Wang
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Limin Xiao
Limin Xiao
FDU
Fiber OpticsOptoelectronics
Meng Han
Meng Han
Intelligence Fusion Research Center (IFRC)
Reliable AIData MiningMachine LearningBig DataSecurity&Privacy
Lin Sun
Lin Sun
Qihoo 360
large language model
S
Shuai Zheng
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
X
Xiangrong Xu
School of Computer Science and Engineering, Beihang University, Beijing 100191, China