Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical

📅 2024-07-10

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the insufficient inference capability of resource-constrained edge devices, this paper systematically compares hierarchical inference (HI) against pure on-device inference across accuracy, latency, and energy consumption. Existing HI studies overlook device-side latency and energy overhead and fail to jointly model the heterogeneous hardware, network, and model dimensions. To bridge this gap, we conduct the first multi-dimensional empirical evaluation of HI on real embedded hardware. We propose Early Exit with HI (EE-HI), a hybrid mechanism that dynamically coordinates lightweight local models and remote servers: it adaptively offloads samples and enables early exit during image classification, thereby eliminating HI’s inherent fixed overhead. Experiments show that HI reduces latency by up to 73% and device energy consumption by up to 77% compared to pure on-device inference; EE-HI further improves upon HI by reducing latency by 59.7% and device energy consumption by 60.4%.

Technology Category

Application Category

📝 Abstract

On-device inference holds great potential for increased energy efficiency, responsiveness, and privacy in edge ML systems. However, due to less capable ML models that can be embedded in resource-limited devices, use cases are limited to simple inference tasks such as visual keyword spotting, gesture recognition, and predictive analytics. In this context, the Hierarchical Inference (HI) system has emerged as a promising solution that augments the capabilities of the local ML by offloading selected samples to an edge server or cloud for remote ML inference. Existing works demonstrate through simulation that HI improves accuracy. However, they do not account for the latency and energy consumption on the device, nor do they consider three key heterogeneous dimensions that characterize ML systems: hardware, network connectivity, and models. In contrast, this paper systematically compares the performance of HI with on-device inference based on measurements of accuracy, latency, and energy for running embedded ML models on five devices with different capabilities and three image classification datasets. For a given accuracy requirement, the HI systems we designed achieved up to 73% lower latency and up to 77% lower device energy consumption than an on-device inference system. The key to building an efficient HI system is the availability of small-size, reasonably accurate on-device models whose outputs can be effectively differentiated for samples that require remote inference. Despite the performance gains, HI requires on-device inference for all samples, which adds a fixed overhead to its latency and energy consumption. Therefore, we design a hybrid system, Early Exit with HI (EE-HI), and demonstrate that compared to HI, EE-HI reduces the latency by up to 59.7% and lowers the device's energy consumption by up to 60.4%.

Problem

Research questions and friction points this paper is trying to address.

Enhancing edge ML with hierarchical inference for complex tasks

Reducing latency and energy in on-device ML systems

Optimizing hybrid systems for accuracy and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Inference system for edge ML

Hybrid Early Exit with HI design

Optimizes latency and energy consumption

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow