🤖 AI Summary
In multi-node large language model (LLM) inference, achieving both high throughput and low interactive latency remains challenging. Method: This paper proposes a systematic decoupled deployment framework for the inference phase, separating prefill and decoding stages while enabling dynamic coordination between them. Contribution/Results: For the first time, we conduct end-to-end modeling and exhaustive search across hundreds of thousands of design points on real hardware under diverse workloads. Our analysis reveals that decoupling yields maximal gains under prefill-intensive traffic—common in LLM serving—and identifies dynamic rate matching and elastic scaling as critical enablers. The proposed fine-grained stage coordination and resource scheduling strategy achieves 2.3× higher throughput and 40% lower first-token latency on typical services, approaching the theoretical Pareto-optimal frontier. The study delivers actionable, production-ready deployment guidelines for scalable LLM inference systems.
📝 Abstract
As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.