🤖 AI Summary
This work addresses the challenges of plasma particle-in-cell simulations on heterogeneous supercomputing systems, including frequent data movement, high synchronization overhead, and underutilized multi-GPU resources. The authors propose a hybrid MPI+OpenMP parallelization strategy that leverages OpenMP tasking with explicit dependency management to overlap computation and communication on both NVIDIA and AMD GPUs. By integrating persistent device memory, a one-dimensional contiguous data layout, pinned host memory, and GPU-direct DMA transfers, the approach significantly enhances data transfer efficiency and device memory access. Furthermore, seamless integration with openPMD and ADIOS2 enables high-performance I/O. Evaluated on pre-exascale systems such as Frontier, the implementation scales to 16,000 GPUs, substantially reducing runtime while markedly improving portability, scalability, and hardware utilization.
📝 Abstract
Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.