The Big Send-off: High Performance Collectives on GPU-based Supercomputers

πŸ“… 2025-04-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low resource utilization and poor scalability beyond thousands of GPUs in LLM training on GPU supercomputers (e.g., Frontier), where existing collective communication libraries (RCCL, Cray-MPICH) underutilize hardware, this paper introduces PCCLβ€”the first communication library designed for heterogeneous GPU clusters that achieves full concurrent utilization of both network and compute resources for all-gather and reduce-scatter operations. PCCL employs topology-aware scheduling, fine-grained pipelining, computation-communication overlap, and custom GPU kernels, with deep optimizations across CUDA, NCCL, and HSA layers. Experimental evaluation on Frontier at 2048 GPUs shows PCCL delivers 6–33Γ— higher all-gather throughput than RCCL and 28–70Γ— higher than Cray-MPICH. For end-to-end GPT-3 (7B) training, PCCL achieves up to 60% speedup.

Technology Category

Application Category

πŸ“ Abstract
We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier -- Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.
Problem

Research questions and friction points this paper is trying to address.

Optimizing collective communication for GPU-based LLM training
Addressing scalability and resource underutilization in RCCL and Cray-MPICH
Enhancing all-gather and reduce-scatter operations for distributed deep learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

PCCL optimizes all-gather and reduce-scatter operations
Maximizes network and compute resource utilization
Scales efficiently to thousands of GPUs
πŸ”Ž Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5