Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing LLM agent scheduling systems (e.g., vLLM) operate at the inference-token granularity, making them ill-suited for agent-centric multi-stage workflows—characterized by alternating local computation and external API calls—and resulting in suboptimal end-to-end job completion time (JCT). This paper proposes a state-aware hierarchical scheduling framework to address this limitation. First, we introduce a novel state modeling mechanism that jointly incorporates request history and behavioral prediction. Second, we design an I/O- and compute-aware enhanced Highest Response Ratio Next (HRRN) scheduling policy. Third, we develop an adaptive KV cache management scheme to preserve state consistency during I/O wait periods. Evaluated under realistic agent workloads, our framework reduces average JCT by 25.5% and demonstrates strong robustness and stability across diverse model scales and high-load scenarios.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.

Problem

Research questions and friction points this paper is trying to address.

Optimizes end-to-end latency for LLM agent workflows

Shifts scheduling from local segments to global request lifecycle

Manages agent state during I/O waits under memory pressure

Innovation

Methods, ideas, or system contributions that make the work stand out.

State-aware hierarchical scheduling for global optimization

Dynamic request classification with enhanced HRRN policy

Adaptive KV cache management during I/O waits

🔎 Similar Papers

No similar papers found.