🤖 AI Summary
To address load imbalance, intra- and inter-stage execution bubbles, and KV cache bottlenecks in multi-GPU inference of large language models (LLMs) under pipeline parallelism (PP), this paper proposes SiPipe—a heterogeneous pipeline architecture. Its core innovations are: (1) offloading sampling, safety-token execution, and structure-aware communication to the CPU to mitigate bubbles and improve pipeline saturation; and (2) co-optimizing KV cache management with asynchronous CPU–GPU task scheduling. Experiments across multiple mainstream LLMs demonstrate that SiPipe achieves up to 2.1× higher throughput, reduces per-token latency by 43%, and improves average GPU utilization by 23% over baseline PP implementations. These gains significantly enhance inference efficiency and hardware resource utilization in multi-GPU LLM deployment.
📝 Abstract
As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV) cache capacity and inference throughput. However, PP suffers from inherent inefficiencies caused by three types of execution bubbles-load-imbalance, intra-stage, and inter-stage-which limit pipeline saturation. We present SiPipe, a heterogeneous pipeline design that improves throughput by leveraging underutilized CPU resources to offload auxiliary computation and communication. SiPipe incorporates three key techniques-CPU sampling, a token-safe execution model, and structure-aware transmission-to mitigate pipeline bubbles and improve execution efficiency. Across diverse LLMs, SiPipe achieves up to 2.1 times higher throughput, 43% lower per-token latency, and up to 23% higher average GPU utilization compared to the state-of-the-art vLLM under the same PP configuration, demonstrating its generality across LLMs and deployment scenarios.