PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address low resource utilization and limited throughput in video analytics inference services on heterogeneous GPU clusters, this paper proposes PPipe—a novel system that introduces pooled pipeline parallelism to latency-sensitive inference for the first time. Leveraging the observation that low-end and high-end GPUs exhibit comparable per-layer inference latency for multi-layer models, PPipe enables cross-device collaborative computation. It features an MILP-based optimization control plane, a resource reservation mechanism, and an adaptive dynamic batching strategy to unify scheduling across heterogeneous accelerators. Evaluation across 18 CNN models demonstrates that PPipe improves low-end GPU utilization by 41.1%–65.5%, increases end-to-end service throughput by 32.2%–75.1% over baseline systems, and significantly enhances cooperative efficiency and scalability of inference services on heterogeneous hardware.

Technology Category

Application Category

📝 Abstract

With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for throughput-oriented deep learning model training, can be used effectively for serving latency-bound model inference, e.g., in video analytics systems, on heterogeneous GPU clusters. Our work exploits the synergy between diversity in model layers and diversity in GPU architectures, which results in comparable inference latency for many layers when running on low-class and high-class GPUs. We explore how such overlooked capability of low-class GPUs can be exploited using pipeline parallelism and present a novel inference serving system, PPipe, that employs pool-based pipeline parallelism via an MILP-based control plane and a data plane that performs resource reservation-based adaptive batching. Evaluation results on diverse workloads (18 CNN models) show that PPipe achieves 41.1% - 65.5% higher utilization of low-class GPUs while maintaining high utilization of high-class GPUs, leading to 32.2% - 75.1% higher serving throughput compared to various baselines.

Problem

Research questions and friction points this paper is trying to address.

Efficient video analytics serving on heterogeneous GPU clusters

Utilizing pipeline parallelism for latency-bound model inference

Improving GPU utilization and throughput in video analytics systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pool-based pipeline parallelism for video analytics

MILP-based control plane for resource optimization

Adaptive batching with resource reservation

🔎 Similar Papers

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference