๐ค AI Summary
This work addresses the inefficiency of existing GPU schedulers that treat LLM invocations within AI agent tasks as independent requests, discarding intermediate states such as KV caches and thereby significantly increasing end-to-end latency. To overcome this limitation, the paper introduces a program-level scheduling paradigm featuring workflow-atomic scheduling, where entire agent workflows are treated as atomic scheduling units. By modeling execution dependencies through an Agent Execution Graph, the system predicts cross-tool-call KV cache reuse and integrates session-affinity batching, work stealing, and an Agent Fair Share scheduling algorithm to enhance resource efficiency while guaranteeing bounded fairness deviation. Evaluated on a 64-GPU cluster, the proposed approach reduces task completion time by 1.64ร and improves GPU memory utilization by 1.22ร compared to vLLM v0.15.1, achieving a 99.2% SLO compliance rate and performance approaching that of an optimal caching strategy.
๐ Abstract
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of Bรฉlรกdy's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.