🤖 AI Summary
Existing LLM agent scheduling systems (e.g., vLLM) operate at the inference-token granularity, making them ill-suited for agent-centric multi-stage workflows—characterized by alternating local computation and external API calls—and resulting in suboptimal end-to-end job completion time (JCT). This paper proposes a state-aware hierarchical scheduling framework to address this limitation. First, we introduce a novel state modeling mechanism that jointly incorporates request history and behavioral prediction. Second, we design an I/O- and compute-aware enhanced Highest Response Ratio Next (HRRN) scheduling policy. Third, we develop an adaptive KV cache management scheme to preserve state consistency during I/O wait periods. Evaluated under realistic agent workloads, our framework reduces average JCT by 25.5% and demonstrates strong robustness and stability across diverse model scales and high-load scenarios.
📝 Abstract
Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.