đ€ AI Summary
This work addresses the challenge of concurrently serving multiple deep neural network (DNN) models on shared GPUs at the edge while meeting tail-latency and service-level objective (SLO) requirements. The authors propose aćć scheduling framework that integrates time-sliced GPU sharing with early-exit inference, jointly optimizing model selection, exit points, and batch sizes at runtime to minimize system-wide SLO violations. A key innovation is the introduction of a stability score that quantifies the impact of scheduling decisions on future queue states, thereby expanding the feasible action space under tight constraints and enhancing latency predictability. Experimental results across diverse hardware platforms demonstrate significant improvements over existing baselines, with notable reductions in both SLO violation rates and P95 latency.
đ Abstract
As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency, enabled by early-exit mechanism, which expands the scheduling action space under tight latency constraints.