The Big Send-off: High Performance Collectives on GPU-based Supercomputers

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

To address low resource utilization and poor scalability beyond thousands of GPUs in LLM training on GPU supercomputers (e.g., Frontier), where existing collective communication libraries (RCCL, Cray-MPICH) underutilize hardware, this paper introduces PCCL—the first communication library designed for heterogeneous GPU clusters that achieves full concurrent utilization of both network and compute resources for all-gather and reduce-scatter operations. PCCL employs topology-aware scheduling, fine-grained pipelining, computation-communication overlap, and custom GPU kernels, with deep optimizations across CUDA, NCCL, and HSA layers. Experimental evaluation on Frontier at 2048 GPUs shows PCCL delivers 6–33× higher all-gather throughput than RCCL and 28–70× higher than Cray-MPICH. For end-to-end GPT-3 (7B) training, PCCL achieves up to 60% speedup.

Technology Category

Application Category

📝 Abstract

We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier -- Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.

Problem

Research questions and friction points this paper is trying to address.

Optimizing collective communication for GPU-based LLM training

Addressing scalability and resource underutilization in RCCL and Cray-MPICH

Enhancing all-gather and reduce-scatter operations for distributed deep learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

PCCL optimizes all-gather and reduce-scatter operations

Maximizes network and compute resource utilization

Scales efficiently to thousands of GPUs

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization