SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address load imbalance, intra- and inter-stage execution bubbles, and KV cache bottlenecks in multi-GPU inference of large language models (LLMs) under pipeline parallelism (PP), this paper proposes SiPipe—a heterogeneous pipeline architecture. Its core innovations are: (1) offloading sampling, safety-token execution, and structure-aware communication to the CPU to mitigate bubbles and improve pipeline saturation; and (2) co-optimizing KV cache management with asynchronous CPU–GPU task scheduling. Experiments across multiple mainstream LLMs demonstrate that SiPipe achieves up to 2.1× higher throughput, reduces per-token latency by 43%, and improves average GPU utilization by 23% over baseline PP implementations. These gains significantly enhance inference efficiency and hardware resource utilization in multi-GPU LLM deployment.

Technology Category

Application Category

📝 Abstract

As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV) cache capacity and inference throughput. However, PP suffers from inherent inefficiencies caused by three types of execution bubbles-load-imbalance, intra-stage, and inter-stage-which limit pipeline saturation. We present SiPipe, a heterogeneous pipeline design that improves throughput by leveraging underutilized CPU resources to offload auxiliary computation and communication. SiPipe incorporates three key techniques-CPU sampling, a token-safe execution model, and structure-aware transmission-to mitigate pipeline bubbles and improve execution efficiency. Across diverse LLMs, SiPipe achieves up to 2.1 times higher throughput, 43% lower per-token latency, and up to 23% higher average GPU utilization compared to the state-of-the-art vLLM under the same PP configuration, demonstrating its generality across LLMs and deployment scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiencies in pipeline parallelism for LLM inference

Mitigates execution bubbles limiting pipeline saturation

Improves throughput by leveraging underutilized CPU resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous pipeline design using CPU resources

CPU sampling for auxiliary computation offloading

Token-safe execution model for pipeline efficiency

🔎 Similar Papers

No similar papers found.