Serving Compound Inference Systems on Datacenter GPUs

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of jointly optimizing end-to-end latency, accuracy, and GPU resource cost in composite inference systems composed of multiple machine learning models—a task that existing approaches struggle to achieve effectively. To this end, we propose JigsawServe, the first framework capable of performing such joint optimization. JigsawServe leverages task-graph-aware fine-grained spatial partitioning of GPU resources, adaptive selection of model variants, and SLO-aware scheduling to allocate resources efficiently while meeting both accuracy and latency service-level objectives (SLOs). Experimental results demonstrate that, compared to the state-of-the-art baseline, JigsawServe improves request throughput by up to 11.3×, reduces GPU resource consumption to just 43.3% of the baseline, and maintains a latency violation rate below 0.6%.

Technology Category

Application Category

📝 Abstract

Applications in emerging domains such as XR are being built as compound inference systems, where multiple ML models are composed in the form of a task graph to service each request. Serving these compound systems efficiently raises two questions: how to apportion end-to-end latency and accuracy budgets between different tasks in a compound inference system, and how to allocate resources effectively for different models with varying resource requirements. We present JigsawServe, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system. Analytical evaluation of a system with a large number of GPUs shows that JigsawServe can increase the maximum serviceable demand (in requests per second) by 11.3x when compared to the closest prior work. Our empirical evaluation shows that for a large range of scenarios, JigsawServe consumes only 43.3% of the available GPU resources while meeting accuracy SLOs with less than 0.6% latency SLO violations. All of the features in JigsawServe contribute to this high efficiency -- sacrificing any one feature of accuracy scaling, GPU spatial partitioning, or task-graph-informed resource budgeting significantly reduces efficiency.

Problem

Research questions and friction points this paper is trying to address.

compound inference systems

latency budgeting

accuracy budgeting

GPU resource allocation

task graph

Innovation

Methods, ideas, or system contributions that make the work stand out.

compound inference systems

GPU spatial partitioning

adaptive model selection