ALTO: An Efficient Network Orchestrator for Compound AI Systems

📅 2024-03-07
🏛️ EuroMLSys@EuroSys
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
Composite AI systems—such as LLM pipelines—exhibit heterogeneous components, variable-granularity data, and dynamic aggregation of intermediate results, hindering effective exploitation of parallelism and pipelining optimizations. This leads to low service throughput and high tail latency. To address this, we present the first network orchestrator tailored for composite AI workloads. Its core innovations are: (1) an aggregation-aware routing interface enabling semantic-consistent distribution of variable-length text streams; and (2) a prompt-aware distributed scheduling mechanism that achieves state-sensitive load balancing and partial-output streaming. Evaluated on a complex chatbot pipeline under a 4-second SLO, our approach improves throughput by 3× and reduces P99 latency by 1.8×. It is the first to deeply integrate fine-grained streaming execution into the AI service orchestration stack.

Technology Category

Application Category

📝 Abstract
We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO leverages an optimization opportunity specific to generative language models, which is streaming intermediate outputs from the language model to downstream stages. We highlight two challenges that emerge while serving these applications at scale: handling how some stages can be stateful across partial outputs, and handling how language models can produce variable amounts of text. To address these challenges, we motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling. ALTO's partial output streaming increases throughput by up to 3× for a fixed latency target of 4 seconds / request and reduces tail latency by 1.8× compared to a baseline serving approach, on a complex chat bot verification pipeline.
Problem

Research questions and friction points this paper is trying to address.

Optimizing parallelism in compound AI systems with diverse component constraints
Managing intermediate data fragmentation in distributed AI workflows
Automating complex dataflow routing for heterogeneous AI components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated optimization via streaming and parallelism
Nested ancestry metadata for data tracking
Inferred metadata for complex dataflow patterns
🔎 Similar Papers
No similar papers found.