GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

๐Ÿ“… 2026-05-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

232K/year
๐Ÿค– AI Summary
This work addresses the scheduling challenges of agent-based large language model inference in heterogeneous GPU resource pools by proposing GoodServe, a system designed to meet end-to-end latency constraints while maximizing goodput. GoodServe introduces an innovative integration of output length prediction, GPU load state modeling, and a runtime request migration mechanism. It employs a โ€œjust-enough instance selectionโ€ heuristic to enable supply-and-demand-aware, high-quality routing decisions and dynamically monitors the risk of SLO violations to trigger timely migrations. Experimental results demonstrate that GoodServe achieves up to a 27.4% improvement in goodput compared to existing approaches.
๐Ÿ“ Abstract
Large Language Models (LLMs) play a critical role in emerging agentic applications, where the timely completion of each entire inference is critical. Meanwhile, agentic LLM inferences are increasingly served on heterogeneous GPUs in operator's resource pools. Therefore, it is crucial to route incoming inference requests to appropriate GPUs so that their end-to-end latency requirements are satisfied whenever possible, thereby achieving high goodput. In this paper, we propose GoodServe, a goodput-optimized serving system for agentic inferences over heterogeneous resources. GoodServe performs inference routing in a predict-and-rectify manner. It estimates the request output lengths as well as the GPU serving status in an accurate and also practical manner. Based on information from both the demand and resource sides, it then makes high-quality routing decisions using a just-enough instance selection heuristic. It also periodically monitors SLO-violation risks of active requests and triggers runtime request migrations to address unexpected dynamics. Our evaluations show that GoodServe improves goodput by up to 27.4% over existing routing methods.
Problem

Research questions and friction points this paper is trying to address.

agentic LLM inference
heterogeneous resources
goodput
latency requirements
request routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

goodput optimization
heterogeneous GPU serving
LLM inference routing
runtime request migration
SLO-aware scheduling
๐Ÿ”Ž Similar Papers
No similar papers found.