Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address high load and long latency in shared large language model (LLM) serving for multi-agent applications—caused by request interleaving—this paper proposes an end-to-end low-latency collaborative serving system. Our method introduces two key innovations: (1) a workflow-aware scheduler that jointly models inter-task dependency latency and heterogeneous resource disparities to enable workflow-level priority ranking; and (2) a memory-adaptive dispatcher that dynamically routes requests based on real-time LLM GPU memory states. The system integrates a workflow orchestrator, priority scheduler, and memory-aware dispatcher to support runtime dependency analysis and coordinated resource allocation. Extensive experiments on public cloud infrastructure demonstrate that our approach reduces end-to-end latency by 17.8%–28.4% compared to state-of-the-art methods, significantly enhancing the real-time service performance of multi-agent systems.

Technology Category

Application Category

📝 Abstract

Multi-agent applications utilize the advanced capabilities of large language models (LLMs) for intricate task completion through agent collaboration in a workflow. Under this situation, requests from different agents usually access the same shared LLM to perform different kinds of tasks, forcing the shared LLM to suffer excessive loads. However, existing works have low serving performance for these multi-agent applications, mainly due to the ignorance of inter-agent latency and resource differences for request scheduling. We therefore propose Kairos, a multi-agent orchestration system that optimizes end-to-end latency for multi-agent applications. Kairos consists of a workflow orchestrator, a workflow-aware priority scheduler, and a memory-aware dispatcher. The orchestrator collects agent-specific information for online workflow analysis. The scheduler decides the serving priority of the requests based on their latency characteristics to reduce the overall queuing. The dispatcher dispatches the requests to different LLM instances based on their memory demands to avoid GPU overloading. Experimental results show that Kairos reduces end-to-end latency by 17.8% to 28.4% compared to state-of-the-art works.

Problem

Research questions and friction points this paper is trying to address.

Optimizes latency in multi-agent LLM workflows

Addresses excessive load on shared LLM resources

Improves scheduling for inter-agent task differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Workflow orchestrator for online agent analysis

Priority scheduler reducing queuing latency

Memory-aware dispatcher preventing GPU overload

🔎 Similar Papers

No similar papers found.