Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address low Infinity Fabric interconnect bandwidth utilization and high cross-CPU/GPU data movement overhead in AMD MI300A multi-APU systems, this work proposes a communication optimization framework for collaborative multi-APU computing. We design an explicit data transfer mechanism and a fine-grained DMA scheduling strategy; systematically characterize and compare the communication behaviors of HIP, MPI, and RCCL over Infinity Fabric; and introduce a unified-memory-aware programming interface alongside a hierarchical memory allocation scheme. Empirical evaluation—using custom microbenchmarks and real-world HPC workloads (Quicksilver and CloverLeaf)—demonstrates up to 2.3× improvement in communication throughput and 1.8× end-to-end application speedup on a four-APU system. These results significantly enhance data movement efficiency in heterogeneous multi-APU architectures.

Technology Category

Application Category

📝 Abstract

The ever-increasing compute performance of GPU accelerators drives up the need for efficient data movements within HPC applications to sustain performance. Proposed as a solution to alleviate CPU-GPU data movement, AMD MI300A Accelerated Processing Unit (APU) combines CPU, GPU, and high-bandwidth memory (HBM) within a single physical package. Leadership supercomputers, such as El Capitan, group four APUs within a single compute node, using Infinity Fabric Interconnect. In this work, we design specific benchmarks to evaluate direct memory access from the GPU, explicit inter-APU data movement, and collective multi-APU communication. We also compare the efficiency of HIP APIs, MPI routines, and the GPU-specialized RCCL library. Our results highlight key design choices for optimizing inter-APU communication on multi-APU AMD MI300A systems with Infinity Fabric, including programming interfaces, allocators, and data movement. Finally, we optimize two real HPC applications, Quicksilver and CloverLeaf, and evaluate them on a four MI100A APU system.

Problem

Research questions and friction points this paper is trying to address.

Evaluating inter-APU communication efficiency on AMD MI300A systems

Comparing HIP APIs, MPI routines, and RCCL library performance

Optimizing HPC applications for multi-APU systems with Infinity Fabric

Innovation

Methods, ideas, or system contributions that make the work stand out.

AMD MI300A APU combines CPU, GPU, HBM

Infinity Fabric enables inter-APU communication

Optimized HIP, MPI, RCCL for data movement

🔎 Similar Papers

No similar papers found.