🤖 AI Summary
To address redundant data movement, buffer overhead, and high communication latency in MPI caused by explicit memory copying in distributed HPC systems, this paper proposes the first cache-coherent shared-memory MPI communication paradigm based on CXL 3.2. Leveraging hardware-supported cross-host virtual address space direct mapping, it enables message pointer passing instead of data copying, thereby eliminating traditional MPI’s memory copy and serialization overheads. We design and implement an end-to-end co-designed system integrating a CXL controller, an FPGA-based multi-host emulation platform, and a customized software stack to support zero-copy, low-latency MPI runtime execution. Experimental evaluation on representative HPC benchmarks demonstrates up to 47% reduction in communication latency and a 2.1× improvement in bandwidth utilization, significantly enhancing scalability and energy efficiency for large-scale applications.
📝 Abstract
MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In response, we introduce MPI-over-CXL, a novel MPI communication paradigm leveraging CXL, which provides cache-coherent shared memory across multiple hosts. MPI-over-CXL replaces traditional data-copy methods with direct shared memory access, significantly reducing communication latency and memory bandwidth usage. By mapping shared memory regions directly into the virtual address spaces of MPI processes, our design enables efficient pointer-based communication, eliminating redundant copying operations. To validate this approach, we implement a comprehensive hardware and software environment, including a custom CXL 3.2 controller, FPGA-based multi-host emulation, and dedicated software stack. Our evaluations using representative benchmarks demonstrate substantial performance improvements over conventional MPI systems, underscoring MPI-over-CXL's potential to enhance efficiency and scalability in large-scale HPC environments.