🤖 AI Summary
Modern AI models are scaling rapidly, while interconnect bandwidth growth lags, making multi-GPU communication a critical performance bottleneck. Existing overlap optimization techniques struggle to approach theoretical peak throughput under heterogeneous workloads and on emerging accelerators. This paper introduces the first systematic, general-purpose multi-GPU kernel design paradigm: it defines eight fundamental communication primitives and establishes a unified programming template, reducing complex kernel development to reusable, principle-based abstractions. Built upon CUDA extensions, the ThunderKittens framework integrates transmission mechanism modeling, resource-aware scheduling optimization, and overhead control. Evaluated on Hopper and Blackwell architectures, it achieves 2.33×–4.08× speedup over state-of-the-art baselines using fewer than 50 lines of device code—significantly improving both cross-architecture and cross-workload communication efficiency as well as developer productivity.
📝 Abstract
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 imes$ speedup for data- and tensor-parallel workloads, $4.08 imes$ for sequence-parallel workloads, and $1.22 imes$ for expert-parallel workloads.