JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Deploying low-bit neural networks on edge devices faces two major challenges: memory explosion during quantization-aware training (QAT) and prohibitively long hardware-aware accelerator mapping search times. This paper proposes a software-hardware co-design framework featuring three-dimensional joint optimization. First, we introduce Channel-wise Sparse Quantization (CSQ), a novel QAT technique that significantly reduces memory overhead by sparsifying quantization parameters at the channel level. Second, we design BatchTile, a compiler-aware hardware generation network that compresses accelerator mapping search into a single inference pass—0.15 seconds—enabling millisecond-scale hardware adaptation. Third, we unify neural architecture search (NAS), low-bit quantization, and hardware customization into a single differentiable optimization objective. Evaluated on ImageNet, our method achieves a ~7% Top-1 accuracy improvement over state-of-the-art methods, marking the first end-to-end deployment solution that simultaneously delivers high accuracy (<4-bit), ultra-low memory footprint, and sub-millisecond hardware specialization for edge devices.

Technology Category

Application Category

📝 Abstract

The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.

Problem

Research questions and friction points this paper is trying to address.

Neural Network Optimization

Resource-constrained Devices

Low-precision Quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Network Optimization

Sparse Quantization

Accelerator Co-optimization

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms