Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) inference on heterogeneous multi-vendor GPUs—such as NVIDIA and domestic accelerators—suffers from low efficiency, poor compatibility, and suboptimal resource utilization. Method: This paper proposes the first heterogeneous-GPU-aware split Prefill-Decode (P-D) inference framework, decoupling the prefill and decode phases. It introduces an interoperable data transfer module and a dynamic load-balancing mechanism, jointly optimizing tensor parallelism strategies and instance allocation algorithms to overcome cross-vendor data exchange and coordinated scheduling bottlenecks. Contribution/Results: The framework resolves incompatibilities in tensor formats and communication protocols across vendors, achieving significant throughput gains, over 30% higher GPU resource utilization, reduced deployment costs, and diminished vendor lock-in risk. Experimental evaluation validates its effectiveness and practicality in real-world heterogeneous GPU environments.

Technology Category

Application Category

📝 Abstract
LLM-based applications have been widely used in various industries, but with the increasing of models size, an efficient large language model (LLM) inference system is an urgent problem to be solved for service providers. Since the inference system is divided into two stage with different characteristics: Prefill and Decode, the two stage will interfere with each other during the inference process. Toward this end, a P-D disaggregated inference framework is proposed by some researchers. Current research is done on homogeneous GPUs, and lacks deployment solutions based on business scenarios. Compared with homogeneous GPUs, using heterogeneous GPUs to construct inference systems can better improve resource utilization and reduce costs. Even if GPUs from different vendors are used to build inference systems, on the basis of reducing costs, the resource utilization rate can be improved and the dependence on a single vendor can be reduced. Therefore, a P-D disaggreagetd inference system based on heterogeneous GPUs is designed, and the heterogeneous compatible transmission module in the system is designed to address heterogeneous GPU data compatibility issues. Then, a joint optimization algorithm of parallel strategy and instance number allocation is proposed to obtain the deployment solutions. Finally, the experimental results show that the P-D disaggregated inference system can well solve the hybrid inference problem of heterogeneous GPUs from different vendors, and the joint optimization algorithm can obtain the optimal deployment solution.
Problem

Research questions and friction points this paper is trying to address.

Solving interference between Prefill and Decode stages in LLM inference
Addressing heterogeneous GPU data compatibility in multi-vendor systems
Optimizing resource utilization and reducing costs for LLM serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated prefill and decoding stages for LLM inference
Heterogeneous multi-vendor GPU compatibility transmission module
Joint optimization algorithm for parallel strategy and allocation
🔎 Similar Papers
No similar papers found.
X
Xing Chen
Wireless and Computing Product R&D Institute, ZTE Corporation, China
R
Rong Shi
Wireless and Computing Product R&D Institute, ZTE Corporation, China
L
Lu Zhao
Wireless and Computing Product R&D Institute, ZTE Corporation, China
L
Lingbin Wang
Wireless and Computing Product R&D Institute, ZTE Corporation, China
Xiao Jin
Xiao Jin
CUHK
CV && RecSys
Y
Yueqiang Chen
Wireless and Computing Product R&D Institute, ZTE Corporation, China
H
Hongfeng Sun
Wireless and Computing Product R&D Institute, ZTE Corporation, China