🤖 AI Summary
This work addresses the challenge of efficiently adapting heterogeneous accelerators for multi-DNN inference in edge computing systems, where existing approaches suffer from high SLO violation rates. The authors propose SparseLoom, a novel system that introduces, for the first time, a retraining-free model stitching mechanism. By dynamically reconfiguring sparse subgraphs, SparseLoom generates hardware-adaptive model variants on the fly, and integrates heterogeneous task scheduling with memory optimization to enable efficient collaborative inference on edge SoCs. Experimental results demonstrate that SparseLoom reduces SLO violations by up to 74%, achieves a 2.31× improvement in throughput, and decreases average memory overhead by 28% compared to state-of-the-art solutions.
📝 Abstract
Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.