MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC Systems

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address redundant data movement, buffer overhead, and high communication latency in MPI caused by explicit memory copying in distributed HPC systems, this paper proposes the first cache-coherent shared-memory MPI communication paradigm based on CXL 3.2. Leveraging hardware-supported cross-host virtual address space direct mapping, it enables message pointer passing instead of data copying, thereby eliminating traditional MPI’s memory copy and serialization overheads. We design and implement an end-to-end co-designed system integrating a CXL controller, an FPGA-based multi-host emulation platform, and a customized software stack to support zero-copy, low-latency MPI runtime execution. Experimental evaluation on representative HPC benchmarks demonstrates up to 47% reduction in communication latency and a 2.1× improvement in bandwidth utilization, significantly enhancing scalability and energy efficiency for large-scale applications.

Technology Category

Application Category

📝 Abstract

MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In response, we introduce MPI-over-CXL, a novel MPI communication paradigm leveraging CXL, which provides cache-coherent shared memory across multiple hosts. MPI-over-CXL replaces traditional data-copy methods with direct shared memory access, significantly reducing communication latency and memory bandwidth usage. By mapping shared memory regions directly into the virtual address spaces of MPI processes, our design enables efficient pointer-based communication, eliminating redundant copying operations. To validate this approach, we implement a comprehensive hardware and software environment, including a custom CXL 3.2 controller, FPGA-based multi-host emulation, and dedicated software stack. Our evaluations using representative benchmarks demonstrate substantial performance improvements over conventional MPI systems, underscoring MPI-over-CXL's potential to enhance efficiency and scalability in large-scale HPC environments.

Problem

Research questions and friction points this paper is trying to address.

Reducing communication latency in distributed HPC systems

Eliminating redundant data copying in MPI implementations

Enhancing memory bandwidth usage through CXL shared memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging CXL for cache-coherent shared memory access

Replacing data-copy with direct shared memory mapping

Implementing custom CXL controller and FPGA emulation

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization

2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5

Microsoft

$139,900 -

no specific locations mentioned

Research Scientist Intern, MSL Infra Kernels & Optimizations (PhD)