ALTO: An Efficient Network Orchestrator for Compound AI Systems

📅 2024-03-07

🏛️ EuroMLSys@EuroSys

📈 Citations: 9

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Composite AI systems—such as LLM pipelines—exhibit heterogeneous components, variable-granularity data, and dynamic aggregation of intermediate results, hindering effective exploitation of parallelism and pipelining optimizations. This leads to low service throughput and high tail latency. To address this, we present the first network orchestrator tailored for composite AI workloads. Its core innovations are: (1) an aggregation-aware routing interface enabling semantic-consistent distribution of variable-length text streams; and (2) a prompt-aware distributed scheduling mechanism that achieves state-sensitive load balancing and partial-output streaming. Evaluated on a complex chatbot pipeline under a 4-second SLO, our approach improves throughput by 3× and reduces P99 latency by 1.8×. It is the first to deeply integrate fine-grained streaming execution into the AI service orchestration stack.

Technology Category

Application Category

📝 Abstract

We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO leverages an optimization opportunity specific to generative language models, which is streaming intermediate outputs from the language model to downstream stages. We highlight two challenges that emerge while serving these applications at scale: handling how some stages can be stateful across partial outputs, and handling how language models can produce variable amounts of text. To address these challenges, we motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling. ALTO's partial output streaming increases throughput by up to 3× for a fixed latency target of 4 seconds / request and reduces tail latency by 1.8× compared to a baseline serving approach, on a complex chat bot verification pipeline.

Problem

Research questions and friction points this paper is trying to address.

Optimizing parallelism in compound AI systems with diverse component constraints

Managing intermediate data fragmentation in distributed AI workflows

Automating complex dataflow routing for heterogeneous AI components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated optimization via streaming and parallelism

Nested ancestry metadata for data tracking

Inferred metadata for complex dataflow patterns

🔎 Similar Papers

Artificial Intelligence for Complex Network: Potential, Methodology and Application