🤖 AI Summary
In existing LLM serving systems, unified architectures suffer from resource contention and scheduling interference between the prefill and decoding phases, while separated architectures incur storage inefficiencies—including weight redundancy, high-cost KV cache migration across GPUs, imbalanced GPU memory utilization, and difficulty in dynamic optimization. This paper proposes a novel “computation-separation, storage-unification” serving architecture. We design a SM-granularity asynchronous compute controller and a unified GPU memory manager to enable cross-phase weight sharing and low-overhead KV cache migration. Additionally, we introduce an SLO-aware dynamic resource partitioning algorithm. Evaluated on DeepSeek models, our approach reduces end-to-end latency by 1.27–2.58×; on Llama models, it improves SLO-compliant request throughput by 1.55–1.72×.
📝 Abstract
Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.