π€ AI Summary
To address low resource utilization and poor scalability beyond thousands of GPUs in LLM training on GPU supercomputers (e.g., Frontier), where existing collective communication libraries (RCCL, Cray-MPICH) underutilize hardware, this paper introduces PCCLβthe first communication library designed for heterogeneous GPU clusters that achieves full concurrent utilization of both network and compute resources for all-gather and reduce-scatter operations. PCCL employs topology-aware scheduling, fine-grained pipelining, computation-communication overlap, and custom GPU kernels, with deep optimizations across CUDA, NCCL, and HSA layers. Experimental evaluation on Frontier at 2048 GPUs shows PCCL delivers 6β33Γ higher all-gather throughput than RCCL and 28β70Γ higher than Cray-MPICH. For end-to-end GPT-3 (7B) training, PCCL achieves up to 60% speedup.
π Abstract
We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier -- Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.