Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

In edge computing environments with concurrent multi-task inference—such as object detection, semantic segmentation, and depth estimation—joint optimization of model deployment (loading) and request routing (offloading) across terminal–edge–cloud tiers remains challenging under memory, computation, and communication constraints. Method: This paper proposes J3O, a joint loading–loading–offloading optimization framework that integrates batch-aware multi-task model loading and offloading. It employs an alternating algorithm combining Lagrangian relaxation with submodular optimization and constrained linear programming, explicitly modeling hardware and network constraints within a mixed-integer program while incorporating edge-side batch processing for scalability. Contribution/Results: J3O achieves ≥97% of optimal accuracy on multi-task benchmarks while requiring less than 15% of the runtime of exact solvers. It significantly improves inference efficiency and resource utilization, marking the first work to jointly optimize batch-aware multi-task loading and offloading in hierarchical edge computing systems.

Technology Category

Application Category

📝 Abstract

The growing demand for intelligent services on resource-constrained edge devices has spurred the development of collaborative inference systems that distribute workloads across end devices, edge servers, and the cloud. While most existing frameworks focus on single-task, single-model scenarios, many real-world applications (e.g., autonomous driving and augmented reality) require concurrent execution of diverse tasks including detection, segmentation, and depth estimation. In this work, we propose a unified framework to jointly decide which multi-task models to deploy (onload) at clients and edge servers, and how to route queries across the hierarchy (offload) to maximize overall inference accuracy under memory, compute, and communication constraints. We formulate this as a mixed-integer program and introduce J3O (Joint Optimization of Onloading and Offloading), an alternating algorithm that (i) greedily selects models to onload via Lagrangian-relaxed submodular optimization and (ii) determines optimal offloading via constrained linear programming. We further extend J3O to account for batching at the edge, maintaining scalability under heterogeneous task loads. Experiments show J3O consistently achieves over $97%$ of the optimal accuracy while incurring less than $15%$ of the runtime required by the optimal solver across multi-task benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-task model deployment across edge-cloud hierarchy

Jointly deciding model onloading and query offloading strategies

Maximizing inference accuracy under resource constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint onloading and offloading optimization algorithm

Batching-aware hierarchical multi-task inference framework

Lagrangian-relaxed submodular model selection method

🔎 Similar Papers

BoRA: Bayesian Hierarchical Low-Rank Adaption for Multi-task Large Language Models