QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference

๐Ÿ“… 2025-06-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Edge devices exhibit significant heterogeneity in computational capability, memory capacity, and hardware architecture, while inference must simultaneously satisfy stringent constraints on accuracy, latency, and energy consumption. Method: This paper proposes a dynamic co-optimization framework for inference servicesโ€”first introducing joint optimization of layer-wise quantization bit-width and model partitioning points, guided by an analytical accuracy degradation model to enable accuracy-aware adaptive collaborative inference. The framework dynamically selects quantization policies and offloading locations based on real-time device compute capability, channel conditions, and task-specific accuracy requirements, enabling per-request customization of computational load distribution. Results: Under an accuracy loss constraint of โ‰ค1%, the method reduces computational load by over 80%, significantly decreases end-to-end latency and energy consumption, and substantially improves system efficiency and robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future inference queries across diverse edge devices, we argue that planning an inference pattern with a request-specific model tailored to the device's computational capacity, accuracy requirements, and time constraints is more cost-efficient and robust to diverse scenarios. To this end, we propose an accuracy-aware and workload-balanced inference system that integrates joint model quantization and inference partitioning. In this approach, the server dynamically responds to inference queries by sending a quantized model and adaptively sharing the inference workload with the device. Meanwhile, the device's computational power, channel capacity, and accuracy requirements are considered when deciding. Furthermore, we introduce a new optimization framework for the inference system, incorporating joint model quantization and partitioning. Our approach optimizes layer-wise quantization bit width and partition points to minimize time consumption and cost while accounting for varying accuracy requirements of tasks through an accuracy degradation metric in our optimization model. To our knowledge, this work represents the first exploration of optimizing quantization layer-wise bit-width in the inference serving system, by introducing theoretical measurement of accuracy degradation. Simulation results demonstrate a substantial reduction in overall time and power consumption, with computation payloads decreasing by over 80% and accuracy degradation kept below 1%.
Problem

Research questions and friction points this paper is trying to address.

Adapts ML inference to edge device constraints and accuracy needs
Optimizes model quantization and workload balancing for efficiency
Minimizes time and power while maintaining high accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic model quantization for edge devices
Joint workload balancing and partitioning
Layer-wise bit-width optimization framework
๐Ÿ”Ž Similar Papers
No similar papers found.