🤖 AI Summary
To address efficiency bottlenecks in deploying deep neural networks (DNNs) on mobile and edge devices, this paper proposes an end-to-end trainable mixed-precision quantization method. The approach introduces a differentiable round-clamp quantizer that jointly leverages Hessian information to enforce bit-level sparsity regularization. Crucially, it avoids explicit bit-width parameter separation and instead performs differentiable pruning directly on weight低位 bits, unifying precision allocation and sparse structure optimization. This design circumvents the high memory overhead and training complexity inherent in conventional bit-level sparse methods. Experiments demonstrate that the method achieves state-of-the-art accuracy and compression ratios while reducing trainable parameters by up to 8.00× and accelerating training time by 86%, significantly enhancing edge-device adaptability and deployment efficiency.
📝 Abstract
As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies utilizing bit-level sparsity have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer to enable differentiable computation of the least significant bits (LSBs) from model weights. It further employs regularization to induce sparsity in these LSBs, enabling effective precision reduction without explicit bit-level parameter splitting. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ achieves up to 8.00x reduction in trainable parameters and up to 86% reduction in training time compared to previous bit-level quantization, while maintaining competitive accuracy and compression rates. This makes it a practical solution for training efficient DNNs on resource-constrained devices.