🤖 AI Summary
Composite AI systems—such as LLM pipelines—exhibit heterogeneous components, variable-granularity data, and dynamic aggregation of intermediate results, hindering effective exploitation of parallelism and pipelining optimizations. This leads to low service throughput and high tail latency. To address this, we present the first network orchestrator tailored for composite AI workloads. Its core innovations are: (1) an aggregation-aware routing interface enabling semantic-consistent distribution of variable-length text streams; and (2) a prompt-aware distributed scheduling mechanism that achieves state-sensitive load balancing and partial-output streaming. Evaluated on a complex chatbot pipeline under a 4-second SLO, our approach improves throughput by 3× and reduces P99 latency by 1.8×. It is the first to deeply integrate fine-grained streaming execution into the AI service orchestration stack.
📝 Abstract
We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO leverages an optimization opportunity specific to generative language models, which is streaming intermediate outputs from the language model to downstream stages. We highlight two challenges that emerge while serving these applications at scale: handling how some stages can be stateful across partial outputs, and handling how language models can produce variable amounts of text. To address these challenges, we motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling. ALTO's partial output streaming increases throughput by up to 3× for a fixed latency target of 4 seconds / request and reduces tail latency by 1.8× compared to a baseline serving approach, on a complex chat bot verification pipeline.