🤖 AI Summary
To address low Infinity Fabric interconnect bandwidth utilization and high cross-CPU/GPU data movement overhead in AMD MI300A multi-APU systems, this work proposes a communication optimization framework for collaborative multi-APU computing. We design an explicit data transfer mechanism and a fine-grained DMA scheduling strategy; systematically characterize and compare the communication behaviors of HIP, MPI, and RCCL over Infinity Fabric; and introduce a unified-memory-aware programming interface alongside a hierarchical memory allocation scheme. Empirical evaluation—using custom microbenchmarks and real-world HPC workloads (Quicksilver and CloverLeaf)—demonstrates up to 2.3× improvement in communication throughput and 1.8× end-to-end application speedup on a four-APU system. These results significantly enhance data movement efficiency in heterogeneous multi-APU architectures.
📝 Abstract
The ever-increasing compute performance of GPU accelerators drives up the need for efficient data movements within HPC applications to sustain performance. Proposed as a solution to alleviate CPU-GPU data movement, AMD MI300A Accelerated Processing Unit (APU) combines CPU, GPU, and high-bandwidth memory (HBM) within a single physical package. Leadership supercomputers, such as El Capitan, group four APUs within a single compute node, using Infinity Fabric Interconnect. In this work, we design specific benchmarks to evaluate direct memory access from the GPU, explicit inter-APU data movement, and collective multi-APU communication. We also compare the efficiency of HIP APIs, MPI routines, and the GPU-specialized RCCL library. Our results highlight key design choices for optimizing inter-APU communication on multi-APU AMD MI300A systems with Infinity Fabric, including programming interfaces, allocators, and data movement. Finally, we optimize two real HPC applications, Quicksilver and CloverLeaf, and evaluate them on a four MI100A APU system.