Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-node large language model (LLM) inference, achieving both high throughput and low interactive latency remains challenging. Method: This paper proposes a systematic decoupled deployment framework for the inference phase, separating prefill and decoding stages while enabling dynamic coordination between them. Contribution/Results: For the first time, we conduct end-to-end modeling and exhaustive search across hundreds of thousands of design points on real hardware under diverse workloads. Our analysis reveals that decoupling yields maximal gains under prefill-intensive traffic—common in LLM serving—and identifies dynamic rate matching and elastic scaling as critical enablers. The proposed fine-grained stage coordination and resource scheduling strategy achieves 2.3× higher throughput and 40% lower first-token latency on typical services, approaching the theoretical Pareto-optimal frontier. The study delivers actionable, production-ready deployment guidelines for scalable LLM inference systems.

Technology Category

Application Category

📝 Abstract
As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.
Problem

Research questions and friction points this paper is trying to address.

Optimizing inference disaggregation for throughput-interactivity trade-offs
Addressing complexity in deployment due to optimization search space
Evaluating disaggregation effectiveness across workloads and hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic study of disaggregated inference at scale
Dynamic rate matching for Pareto-optimal performance
Elastic scaling in prefill-heavy traffic patterns
🔎 Similar Papers
No similar papers found.
T
Tiyasa Mitra
NVIDIA Corporation
R
Ritika Borkar
NVIDIA Corporation
N
Nidhi Bhatia
NVIDIA Corporation
Ramon Matas
Ramon Matas
Principal engineer @ nvidia
Computer ArchitectureMLHPC
S
Shivam Raj
NVIDIA Corporation
Dheevatsa Mudigere
Dheevatsa Mudigere
Distinguished Engineer, NVIDIA
Scientific computingDeep learningApplied numerical methodsHigh performance computingCFD
Ritchie Zhao
Ritchie Zhao
NVIDIA
computer sciencecomputer architecture
Maximilian Golub
Maximilian Golub
Data & Applied Scientist, Microsoft
Machine Learning
A
Arpan Dutta
NVIDIA Corporation
S
Sailaja Madduri
NVIDIA Corporation
D
Dharmesh Jani
NVIDIA Corporation
B
Brian Pharris
NVIDIA Corporation
Bita Darvish Rouhani
Bita Darvish Rouhani
Distinguished Engineer, NVIDIA
Generative AIAI SupercomputingSystems for AI