๐ค AI Summary
This work addresses the scheduling challenges of agent-based large language model inference in heterogeneous GPU resource pools by proposing GoodServe, a system designed to meet end-to-end latency constraints while maximizing goodput. GoodServe introduces an innovative integration of output length prediction, GPU load state modeling, and a runtime request migration mechanism. It employs a โjust-enough instance selectionโ heuristic to enable supply-and-demand-aware, high-quality routing decisions and dynamically monitors the risk of SLO violations to trigger timely migrations. Experimental results demonstrate that GoodServe achieves up to a 27.4% improvement in goodput compared to existing approaches.
๐ Abstract
Large Language Models (LLMs) play a critical role in emerging agentic applications, where the timely completion of each entire inference is critical. Meanwhile, agentic LLM inferences are increasingly served on heterogeneous GPUs in operator's resource pools. Therefore, it is crucial to route incoming inference requests to appropriate GPUs so that their end-to-end latency requirements are satisfied whenever possible, thereby achieving high goodput. In this paper, we propose GoodServe, a goodput-optimized serving system for agentic inferences over heterogeneous resources. GoodServe performs inference routing in a predict-and-rectify manner. It estimates the request output lengths as well as the GPU serving status in an accurate and also practical manner. Based on information from both the demand and resource sides, it then makes high-quality routing decisions using a just-enough instance selection heuristic. It also periodically monitors SLO-violation risks of active requests and triggers runtime request migrations to address unexpected dynamics. Our evaluations show that GoodServe improves goodput by up to 27.4% over existing routing methods.