🤖 AI Summary
To address high latency and energy consumption in DNN inference on resource-constrained devices, this paper proposes a holistic energy-efficiency co-optimization framework jointly tuning memory frequency, computation frequency, wireless transmission power, and task offloading decisions. Departing from conventional CPU/GPU-only DVFS approaches, we are the first to systematically model and empirically validate the critical impact of coordinated memory–computation frequency scaling on DNN inference energy efficiency. We formulate the problem as a mixed-integer nonlinear program (MINLP) integrating multi-dimensional resource scheduling. Our solution combines model-driven optimization with a lightweight data-driven component, unifying dynamic voltage and frequency scaling (DVFS), memory frequency scaling, and edge offloading. Experiments across local and collaborative inference scenarios demonstrate that our method reduces energy consumption by 32.7% and latency by 24.5% on average over baseline approaches, significantly improving the energy-delay trade-off for edge AI inference.
📝 Abstract
Deep neural networks (DNNs) have been widely applied in diverse applications, but the problems of high latency and energy overhead are inevitable on resource-constrained devices. To address this challenge, most researchers focus on the dynamic voltage and frequency scaling (DVFS) technique to balance the latency and energy consumption by changing the computing frequency of processors. However, the adjustment of memory frequency is usually ignored and not fully utilized to achieve efficient DNN inference, which also plays a significant role in the inference time and energy consumption. In this paper, we first investigate the impact of joint memory frequency and computing frequency scaling on the inference time and energy consumption with a model-based and data-driven method. Then by combining with the fitting parameters of different DNN models, we give a preliminary analysis for the proposed model to see the effects of adjusting memory frequency and computing frequency simultaneously. Finally, simulation results in local inference and cooperative inference cases further validate the effectiveness of jointly scaling the memory frequency and computing frequency to reduce the energy consumption of devices.