MiCo: End-to-End Mixed Precision Neural Network Co-Exploration Framework for Edge AI

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the limited flexibility in searching low-bit mixed-precision quantization (MPQ) configurations and the lack of hardware deployment support in edge AI, this paper proposes the first end-to-end MPQ co-optimization framework. The framework integrates hardware-aware latency modeling with a global optimization algorithm to automatically search for layer-wise optimal bit-width assignments under strict latency constraints. It jointly optimizes both quantization-aware training (QAT) and post-training quantization (PTQ), and enables direct compilation of PyTorch models into bare-metal C code for embedded deployment. Unlike prior works, our approach bridges the full pipeline—from MPQ algorithm search to embedded-system implementation—thereby closing a critical gap in edge AI quantization. Evaluated across diverse edge devices, it achieves an average 2.1× inference speedup and 48% memory footprint reduction while preserving over 99% of the original model accuracy.

Technology Category

Application Category

📝 Abstract

Quantized Neural Networks (QNN) with extremely low-bitwidth data have proven promising in efficient storage and computation on edge devices. To further reduce the accuracy drop while increasing speedup, layer-wise mixed-precision quantization (MPQ) becomes a popular solution. However, existing algorithms for exploring MPQ schemes are limited in flexibility and efficiency. Comprehending the complex impacts of different MPQ schemes on post-training quantization and quantization-aware training results is a challenge for conventional methods. Furthermore, an end-to-end framework for the optimization and deployment of MPQ models is missing in existing work. In this paper, we propose the MiCo framework, a holistic MPQ exploration and deployment framework for edge AI applications. The framework adopts a novel optimization algorithm to search for optimal quantization schemes with the highest accuracies while meeting latency constraints. Hardware-aware latency models are built for different hardware targets to enable fast explorations. After the exploration, the framework enables direct deployment from PyTorch MPQ models to bare-metal C codes, leading to end-to-end speedup with minimal accuracy drops.

Problem

Research questions and friction points this paper is trying to address.

Explores optimal mixed-precision quantization for edge AI efficiency

Addresses accuracy-latency trade-offs in neural network quantization

Lacks end-to-end framework for MPQ optimization and deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel optimization algorithm for MPQ schemes

Hardware-aware latency models for fast exploration

Direct deployment from PyTorch to bare-metal C

🔎 Similar Papers

No similar papers found.