🤖 AI Summary
To address the suboptimal throughput and latency caused by static configuration inflexibility in large-scale service-oriented workflow systems under dynamic workloads, this paper proposes a runtime stage-level adaptive configuration system. Our method decouples routing decisions into two complementary components: (i) a one-time global routing for coarse-grained path selection, and (ii) lightweight, fine-grained stage scheduling that enables dynamic model selection during request execution. To ensure efficiency and accuracy, we integrate precision-preserving configuration pruning, real-time system-state awareness, and multi-stage workflow modeling. Evaluated on diverse real-world workflows and models, our system achieves 50.0%–217.0% higher peak throughput and reduces median latency by 32.5%–78.9%, all without compromising inference accuracy.
📝 Abstract
Agentic workflows have emerged as a powerful paradigm for solving complex, multi-stage tasks, but serving them at scale is computationally expensive given the many LLM inferences that each request must pass through. Configuration selection, or the cost-aware assignment of workflow agents to specific LLMs, can reduce these costs, but existing approaches bind configuration decisions before request execution, making them ill-suited for the heterogeneous and lengthy execution of workflows. Specifically, system loads can fluctuate rapidly and substantially during a request's lifetime, causing fixed configurations to quickly become suboptimal. We present Aragog, a system that progressively adapts a request's configuration throughout its execution to match runtime dynamics. To make this practical despite the massive space of workflow configurations, Aragog decouples the problem into two core elements -- a one-time routing step that identifies all accuracy-preserving configurations, and a cheap per-stage scheduler that selects among them using up-to-date system observations -- and introduces novel strategies to accelerate each. Across diverse workflows and model families, Aragog increases maximum serving throughput by 50.0--217.0% and reduces median latency by 32.5--78.9% at peak request rates, while maintaining accuracy comparable to the most expensive configurations.