A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address producer-consumer imbalance arising from resource decoupling between the prefill and decode phases in LLM inference, this paper proposes a dynamic Prefill-Decode (PD) separation architecture. Our method integrates real-time load monitoring, dynamic reconfiguration, and phase-separated deployment. Its core contributions are: (1) a history-aware, adaptive mechanism for tuning the ratio of prefill-to-decode instances, enabling fine-grained, elastic resource scaling; and (2) a co-designed request scheduling policy that dynamically matches heterogeneous (i.e., multi-length) requests to available prefill and decode resources. Experimental evaluation demonstrates that, compared to vLLM and DistServe, our system achieves 1.5× higher goodput, reduces P90 first-token latency by 67.5%, decreases output-token latency by 22.8%, and attains an SLO compliance rate exceeding 99%.

Technology Category

Application Category

📝 Abstract
To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producerconsumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill-to-decoding (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.
Problem

Research questions and friction points this paper is trying to address.

Addresses producer-consumer imbalance in disaggregated LLM inference architectures
Optimizes prefill-to-decoding ratio dynamically through real-time load monitoring
Resolves resource mismatches for mixed-length requests under high concurrency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic prefill-decoding ratio adjustment via real-time monitoring
Request scheduling policy resolves producer-consumer imbalance
Proactive reconfiguration using historical load data
🔎 Similar Papers
No similar papers found.
J
Junhan Liao
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Minxian Xu
Minxian Xu
Associate Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingMicroservicesLLM Inference
W
Wanyi Zheng
Southern University of Science and Technology, and also a joint-training student at the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Y
Yan Wang
College of Computer Science, Inner Mongolia University, Inner Mongolia, China
Kejiang Ye
Kejiang Ye
Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingAI SystemsIndustrial Internet
Rajkumar Buyya
Rajkumar Buyya
School of Computing and Information Systems, The Uni of Melbourne; Fellow of IEEE & Academia Europea
Cloud ComputingData CentersEdge ComputingInternet of ThingsQuantum Computing
C
Chengzhong Xu
State Key Lab of IOTSC, University of Macau, Macau, China