semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

In existing LLM serving systems, unified architectures suffer from resource contention and scheduling interference between the prefill and decoding phases, while separated architectures incur storage inefficiencies—including weight redundancy, high-cost KV cache migration across GPUs, imbalanced GPU memory utilization, and difficulty in dynamic optimization. This paper proposes a novel “computation-separation, storage-unification” serving architecture. We design a SM-granularity asynchronous compute controller and a unified GPU memory manager to enable cross-phase weight sharing and low-overhead KV cache migration. Additionally, we introduce an SLO-aware dynamic resource partitioning algorithm. Evaluated on DeepSeek models, our approach reduces end-to-end latency by 1.27–2.58×; on Llama models, it improves SLO-compliant request throughput by 1.55–1.72×.

Technology Category

Application Category

📝 Abstract

Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.

Problem

Research questions and friction points this paper is trying to address.

Addresses storage inefficiency in disaggregated LLM serving systems

Reduces KV cache transfer overhead and storage imbalance

Optimizes resource adjustment and SLO attainment in LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated computation at SM level

Unified memory manager for phases

SLO-aware dynamic partitioning algorithm

🔎 Similar Papers

No similar papers found.