🤖 AI Summary
In agent workflow services, computational and memory interference across stages leads to low KV cache utilization, constrained throughput, and unpredictable performance. To address this, we propose Cortex—the first workflow-aware, isolation-oriented service architecture for agent workloads. Its core contributions are: (1) a stage-aware resource pooling and isolation mechanism that allocates dedicated CPU and memory resources to each workflow stage; (2) an elastic “agent-state” cache supporting speculative branch execution, dynamic scheduling, and multi-level sharing; and (3) tight co-optimization of workflow-aware scheduling and KV caching. Evaluation shows Cortex improves throughput by up to 2.3×, increases KV cache hit rate by 41%, and significantly reduces tail latency variability—delivering both high efficiency and strong performance determinism for complex agent applications.
📝 Abstract
We introduce Cortex, a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: it provisions dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in compute and memory, leading to better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of agentic workflows, Cortex lays the groundwork for more advanced, agent-native serving paradigms, including malleable resource management, speculative execution of workflow branches, and a shared, multi-tiered cache for "agentic state."