Low-bit Model Quantization for Deep Neural Networks: A Survey

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Deploying deep neural networks (DNNs) faces challenges from high computational overhead and large model sizes. While low-bit weight quantization accelerates inference and reduces memory bandwidth requirements, it often incurs substantial accuracy degradation. This paper presents a systematic survey of low-bit weight quantization research from 2019 to 2024. We propose the first unified taxonomy comprising eight major categories and 24 subcategories—covering linear/nonlinear quantization, layer-wise/channel-wise calibration, retraining-free and fine-tuning-based paradigms, gradient approximation techniques, and mixed-precision search strategies. Through structured comparative analysis of over 100 state-of-the-art works, we identify common bottlenecks, clarify promising future directions, and highlight open challenges. To foster reproducibility and industrial adoption, we open-source Awesome-Model-Quantization—a curated, continuously updated resource repository—thereby advancing standardization and practical deployment of quantization techniques.

Technology Category

Application Category

📝 Abstract

With unprecedented rapid development, deep neural networks (DNNs) have deeply influenced almost all fields. However, their heavy computation costs and model sizes are usually unacceptable in real-world deployment. Model quantization, an effective weight-lighting technique, has become an indispensable procedure in the whole deployment pipeline. The essence of quantization acceleration is the conversion from continuous floating-point numbers to discrete integer ones, which significantly speeds up the memory I/O and calculation, i.e., addition and multiplication. However, performance degradation also comes with the conversion because of the loss of precision. Therefore, it has become increasingly popular and critical to investigate how to perform the conversion and how to compensate for the information loss. This article surveys the recent five-year progress towards low-bit quantization on DNNs. We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques. Furthermore, we shed light on the potential research opportunities in the field of model quantization. A curated list of model quantization is provided at https://github.com/Kai-Liu001/Awesome-Model-Quantization.

Problem

Research questions and friction points this paper is trying to address.

Reducing computation costs and model sizes in DNN deployment

Minimizing performance degradation from low-bit quantization

Surveying and classifying state-of-the-art quantization methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts floating-point numbers to discrete integers

Compensates for precision loss in quantization

Classifies quantization methods into 8 categories

🔎 Similar Papers

No similar papers found.

Authors to Follow