Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the suboptimal throughput and latency caused by static configuration inflexibility in large-scale service-oriented workflow systems under dynamic workloads, this paper proposes a runtime stage-level adaptive configuration system. Our method decouples routing decisions into two complementary components: (i) a one-time global routing for coarse-grained path selection, and (ii) lightweight, fine-grained stage scheduling that enables dynamic model selection during request execution. To ensure efficiency and accuracy, we integrate precision-preserving configuration pruning, real-time system-state awareness, and multi-stage workflow modeling. Evaluated on diverse real-world workflows and models, our system achieves 50.0%–217.0% higher peak throughput and reduces median latency by 32.5%–78.9%, all without compromising inference accuracy.

Technology Category

Application Category

📝 Abstract

Agentic workflows have emerged as a powerful paradigm for solving complex, multi-stage tasks, but serving them at scale is computationally expensive given the many LLM inferences that each request must pass through. Configuration selection, or the cost-aware assignment of workflow agents to specific LLMs, can reduce these costs, but existing approaches bind configuration decisions before request execution, making them ill-suited for the heterogeneous and lengthy execution of workflows. Specifically, system loads can fluctuate rapidly and substantially during a request's lifetime, causing fixed configurations to quickly become suboptimal. We present Aragog, a system that progressively adapts a request's configuration throughout its execution to match runtime dynamics. To make this practical despite the massive space of workflow configurations, Aragog decouples the problem into two core elements -- a one-time routing step that identifies all accuracy-preserving configurations, and a cheap per-stage scheduler that selects among them using up-to-date system observations -- and introduces novel strategies to accelerate each. Across diverse workflows and model families, Aragog increases maximum serving throughput by 50.0--217.0% and reduces median latency by 32.5--78.9% at peak request rates, while maintaining accuracy comparable to the most expensive configurations.

Problem

Research questions and friction points this paper is trying to address.

Optimizing computational costs in scalable agentic workflow serving

Addressing suboptimal fixed configurations during fluctuating system loads

Reducing latency while maintaining accuracy in multi-stage LLM workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Just-in-time model routing for agentic workflows

Decouples routing into accuracy-preserving and scheduling steps

Adapts configurations dynamically using runtime system observations

🔎 Similar Papers

ALTO: An Efficient Network Orchestrator for Compound AI Systems