🤖 AI Summary
This work addresses the challenge of end-to-end latency guarantees for online agent workflows in heterogeneous LLM serving clusters that decouple prefill and decode stages. It proposes the first scheduling framework tailored to such workflows, modeling each request as a runtime-revealed DAG. The framework predicts per-workflow independent completion times and prioritizes ready nodes based on their risk of missing service-level objectives (SLOs). It jointly optimizes placement across heterogeneous GPUs and queue priorities for both prefill and decode phases, while explicitly accounting for KV cache capacity constraints and cross-stage communication overhead. Experiments on A100/H100/H200 clusters demonstrate that, compared to baselines, the proposed approach reduces the average resource requirements to meet 95% and 99% workflow SLOs by 20.1% and 33.0%, respectively, with peak reductions reaching 45.0% and 80.5%.
📝 Abstract
Agentic LLM applications increasingly execute user requests as multi-step workflows involving planning, tool use, branching, refinement, and synthesis. In such settings, users experience the end-to-end latency of an entire workflow, not the latency of any single LLM call. In this paper, we study how to schedule online agentic workflows across heterogeneous prefill-decode disaggregated LLM serving clusters to efficiently meet workflow-level latency objectives. The problem is challenging because workflow dependencies are revealed incrementally at runtime, calls have heterogeneous prompts, outputs, and KV-cache requirements, and the prefill and decode stages impose different compute, memory, and transfer constraints across heterogeneous GPUs. To solve this problem, we present HexAGenT, a workflow-aware scheduler for a heterogeneous prefill-decode inference service. HexAGenT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency. Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively.