Evaluating Multi-Instance DNN Inferencing on Multiple Accelerators of an Edge Device

📅 2024-12-18
🏛️ 2024 IEEE 31st International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Optimizing multi-accelerator orchestration—across GPU CUDA cores, Tensor Cores, and the Deep Learning Accelerator (DLA)—for ResNet50 multi-instance inference on resource-constrained NVIDIA Jetson AGX Orin edge devices remains challenging due to heterogeneous hardware constraints and inter-accelerator contention. Method: We conduct systematic empirical measurements across diverse accelerator combinations and batch sizes, quantifying throughput–latency trade-offs under realistic edge deployment conditions. Contribution/Results: Our analysis reveals that CUDA core + Tensor Core collaboration achieves peak throughput, whereas integrating the DLA degrades overall performance due to memory bandwidth saturation and instruction-scheduling conflicts. We propose a hardware-aware cooperative scheduling framework grounded in measured architectural characteristics. This framework provides empirically validated insights and actionable design guidelines for heterogeneous accelerator resource allocation and runtime scheduling in edge AI platforms, bridging the gap between theoretical acceleration potential and practical system-level efficiency.

Technology Category

Application Category

📝 Abstract
Edge devices like Nvidia Jetson have started to have multiple on-board accelerators such as GPU CUDA cores, Tensor Cores and Deep Learning Accelerators (DLA). Maximizing the DNN inferencing performance of such devices requires us to concurrently use these col-located hardware components, but this has not yet been studied. We analyze the performance of accelerators present in Jetson AGX Orin, both independently and concurrently, using multiple instances of ResNet50 model. We assess the effects of using different combinations of the components and varying batch sizes on the inference throughput and latency. Our results indicate that using CUDA Core with Tensor Cores offers a higher throughput, while using them in conjunction with DLAs reduces the benefits. This paves the way to explore more intelligent configurations to maximize the performance of edge platforms for AI workloads.
Problem

Research questions and friction points this paper is trying to address.

Evaluate multi-instance DNN inferencing on edge device accelerators.
Analyze throughput and latency with varying batch sizes and hardware combinations.
Explore intelligent scheduling to optimize resource utilization on edge platforms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concurrent DNN inferencing on multiple accelerators
Performance analysis with varying batch sizes
Intelligent scheduling for resource optimization
🔎 Similar Papers
No similar papers found.
Mumuksh Tayal
Mumuksh Tayal
Indian Institute of Science, Bangalore
Offline RLImitation LearningSafe ControlHardware Aware Algorithms
Y
Yogesh L. Simmhan
Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, INDIA