🤖 AI Summary
To address the computational bottleneck in plasma simulation for nuclear fusion reactor design, this work targets the compute-intensive nature and strong cross-scale coupling inherent in Particle-in-Cell Monte Carlo (PIC-MC) methods. We propose the first cross-architecture parallel framework integrating MPI inter-node communication, OpenMP/OpenACC heterogeneous thread collaboration, and asynchronous multi-GPU pipelined scheduling. Our approach innovatively overlaps computation, communication, and data transfer via CUDA asynchronous streams, unified memory management, and an adaptive load-balancing algorithm. Evaluated on kilo-particle-scale plasma simulations, the framework achieves up to 12.8× speedup on a 128-GPU cluster, with strong scaling efficiency of 92%. This significantly reduces simulation turnaround time and delivers a scalable, high-performance computing foundation for high-fidelity fusion plasma modeling.