π€ AI Summary
In mobile edge computing, DNN inference faces severe transmission bottlenecks and resource constraints due to conventional layer-wise partitioning and sequential execution. To address this, we propose an operator-level collaborative inference system. Our method breaks inter-layer dependencies by decomposing models into local operators and enabling fine-grained parallel scheduling, thereby deeply overlapping subtask computation with cross-device communication. Crucially, it co-designs the inference strategy with intrinsic model structural characteristics to optimize end-edge collaborative execution. Experimental evaluation demonstrates that, compared to state-of-the-art approaches, our system reduces single-inference latency by up to 50% and energy consumption by up to 75%, while strictly preserving the original modelβs accuracy.
π Abstract
Deploying deep neural networks (DNNs) on resource-constrained mobile devices presents significant challenges, particularly in achieving real-time performance while simultaneously coping with limited computational resources and battery life. While Mobile Edge Computing (MEC) offers collaborative inference with GPU servers as a promising solution, existing approaches primarily rely on layer-wise model partitioning and undergo significant transmission bottlenecks caused by the sequential execution of DNN operations. To address this challenge, we present Intra-DP, a high-performance collaborative inference system optimized for DNN inference on MEC. Intra DP employs a novel parallel computing technique based on local operators (i.e., operators whose minimum unit input is not the entire input tensor, such as the convolution kernel). By decomposing their computations (operations) into several independent sub-operations and overlapping the computation and transmission of different sub-operations through parallel execution, Intra-DP mitigates transmission bottlenecks in MEC, achieving fast and energy-efficient inference. The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines, without sacrificing accuracy.