JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying low-bit neural networks on edge devices faces two major challenges: memory explosion during quantization-aware training (QAT) and prohibitively long hardware-aware accelerator mapping search times. This paper proposes a software-hardware co-design framework featuring three-dimensional joint optimization. First, we introduce Channel-wise Sparse Quantization (CSQ), a novel QAT technique that significantly reduces memory overhead by sparsifying quantization parameters at the channel level. Second, we design BatchTile, a compiler-aware hardware generation network that compresses accelerator mapping search into a single inference pass—0.15 seconds—enabling millisecond-scale hardware adaptation. Third, we unify neural architecture search (NAS), low-bit quantization, and hardware customization into a single differentiable optimization objective. Evaluated on ImageNet, our method achieves a ~7% Top-1 accuracy improvement over state-of-the-art methods, marking the first end-to-end deployment solution that simultaneously delivers high accuracy (<4-bit), ultra-low memory footprint, and sub-millisecond hardware specialization for edge devices.

Technology Category

Application Category

📝 Abstract
The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
Problem

Research questions and friction points this paper is trying to address.

Neural Network Optimization
Resource-constrained Devices
Low-precision Quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Network Optimization
Sparse Quantization
Accelerator Co-optimization
🔎 Similar Papers
No similar papers found.
M
Mingzi Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Y
Yuan Meng
Department of Computer Science and Technology & BNRist, Tsinghua University, Beijing, China
C
Chen Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Department of Computer Science and Technology & BNRist, Tsinghua University, Beijing, China
Weixiang Zhang
Weixiang Zhang
Tsinghua University
Neural Representation3D Computer Vision
Yijian Qin
Yijian Qin
清华大学
AutoMLGraph neural networkNeural architecture search
Y
Yang Yao
Department of Computer Science and Technology & BNRist, Tsinghua University, Beijing, China
Yingxin Li
Yingxin Li
Tsinghua University
LLMVLMEfficient ML
Tongtong Feng
Tongtong Feng
Tsinghua University
Environment LearningAutonomous Embodied AIMultimedia Intelligence
X
Xin Wang
Department of Computer Science and Technology & BNRist, Tsinghua University, Beijing, China
Xun Guan
Xun Guan
CUHK; Laval University; Tsinghua Shenzhen International Graduate School
optical communicationphotonicsradio over fiberoptical sensing
Z
Zhi Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Wenwu Zhu
Wenwu Zhu
Professor, Computer Science, Tsinghua Univerisity
Multimedia ComputingNetwork Representation Learning