🤖 AI Summary
This work addresses the performance limitations of current GPU communication APIs, which either rely on CPU involvement or impose substantial synchronization overhead, thereby constraining the efficiency of machine learning and high-performance computing applications. The authors propose and implement a novel MPI-based GPU communication abstraction that, for the first time, enables fully CPU-bypassed GPU-to-GPU communication within the MPI framework and natively supports halo exchange primitives such as gather and scatter. By integrating MPI extensions, HPE Slingshot 11 network hardware, and the Cabana/Kokkos portable programming model, the design achieves a 50% reduction in medium-message latency and demonstrates a 28% improvement in halo exchange performance at strong scale on 8,192 GPUs on the Frontier supercomputer.
📝 Abstract
Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.