MSQ: Memory-Efficient Bit Sparsification Quantization

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address efficiency bottlenecks in deploying deep neural networks (DNNs) on mobile and edge devices, this paper proposes an end-to-end trainable mixed-precision quantization method. The approach introduces a differentiable round-clamp quantizer that jointly leverages Hessian information to enforce bit-level sparsity regularization. Crucially, it avoids explicit bit-width parameter separation and instead performs differentiable pruning directly on weight低位 bits, unifying precision allocation and sparse structure optimization. This design circumvents the high memory overhead and training complexity inherent in conventional bit-level sparse methods. Experiments demonstrate that the method achieves state-of-the-art accuracy and compression ratios while reducing trainable parameters by up to 8.00× and accelerating training time by 86%, significantly enhancing edge-device adaptability and deployment efficiency.

Technology Category

Application Category

📝 Abstract
As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies utilizing bit-level sparsity have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer to enable differentiable computation of the least significant bits (LSBs) from model weights. It further employs regularization to induce sparsity in these LSBs, enabling effective precision reduction without explicit bit-level parameter splitting. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ achieves up to 8.00x reduction in trainable parameters and up to 86% reduction in training time compared to previous bit-level quantization, while maintaining competitive accuracy and compression rates. This makes it a practical solution for training efficient DNNs on resource-constrained devices.
Problem

Research questions and friction points this paper is trying to address.

Optimizing mixed-precision quantization for DNN efficiency
Reducing training complexity and GPU memory usage
Enhancing precision reduction without bit-level splitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Round-clamp quantizer for differentiable LSB computation
Regularization induces sparsity in least significant bits
Hessian-based pruning of multiple LSBs simultaneously
🔎 Similar Papers
No similar papers found.
S
Seokho Han
Department of Electrical and Computer Engineering, Sungkyunkwan University, Korea
S
Seoyeon Yoon
Department of Electrical and Computer Engineering, Sungkyunkwan University, Korea
J
Jinhee Kim
Department of Electrical and Computer Engineering, Sungkyunkwan University, Korea
D
Dongwei Wang
Department of Electrical and Computer Engineering, University of Arizona, USA
K
Kang Eun Jeon
Department of Electrical and Computer Engineering, Sungkyunkwan University, Korea
Huanrui Yang
Huanrui Yang
Assistant Professor, ECE, University of Arizona
Efficient deep learningTrustworthy deep learning
Jong Hwan Ko
Jong Hwan Ko
SungKyunKwan Univ. (SKKU)
Deep learning acceleratorImage/audio processingVLSI/IoT systems design