MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC Systems

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address redundant data movement, buffer overhead, and high communication latency in MPI caused by explicit memory copying in distributed HPC systems, this paper proposes the first cache-coherent shared-memory MPI communication paradigm based on CXL 3.2. Leveraging hardware-supported cross-host virtual address space direct mapping, it enables message pointer passing instead of data copying, thereby eliminating traditional MPI’s memory copy and serialization overheads. We design and implement an end-to-end co-designed system integrating a CXL controller, an FPGA-based multi-host emulation platform, and a customized software stack to support zero-copy, low-latency MPI runtime execution. Experimental evaluation on representative HPC benchmarks demonstrates up to 47% reduction in communication latency and a 2.1× improvement in bandwidth utilization, significantly enhancing scalability and energy efficiency for large-scale applications.

Technology Category

Application Category

📝 Abstract
MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In response, we introduce MPI-over-CXL, a novel MPI communication paradigm leveraging CXL, which provides cache-coherent shared memory across multiple hosts. MPI-over-CXL replaces traditional data-copy methods with direct shared memory access, significantly reducing communication latency and memory bandwidth usage. By mapping shared memory regions directly into the virtual address spaces of MPI processes, our design enables efficient pointer-based communication, eliminating redundant copying operations. To validate this approach, we implement a comprehensive hardware and software environment, including a custom CXL 3.2 controller, FPGA-based multi-host emulation, and dedicated software stack. Our evaluations using representative benchmarks demonstrate substantial performance improvements over conventional MPI systems, underscoring MPI-over-CXL's potential to enhance efficiency and scalability in large-scale HPC environments.
Problem

Research questions and friction points this paper is trying to address.

Reducing communication latency in distributed HPC systems
Eliminating redundant data copying in MPI implementations
Enhancing memory bandwidth usage through CXL shared memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging CXL for cache-coherent shared memory access
Replacing data-copy with direct shared memory mapping
Implementing custom CXL controller and FPGA emulation
🔎 Similar Papers
No similar papers found.
M
Miryeong Kwon
Panmnesia, Inc.
D
Donghyun Gouk
Panmnesia, Inc.
H
Hyein Woo
Panmnesia, Inc.
J
Junhee Kim
Panmnesia, Inc.
J
Jinwoo Baek
Panmnesia, Inc.
K
Kyungkuk Nam
Panmnesia, Inc.
S
Sangyoon Ji
Panmnesia, Inc.
Jiseon Kim
Jiseon Kim
KAIST
Natural Language ProcessingComputational Social Science
H
Hanyeoreum Bae
Panmnesia, Inc.
J
Junhyeok Jang
Panmnesia, Inc.
H
Hyunwoo You
Panmnesia, Inc.
J
Junseok Moon
Panmnesia, Inc.
Myoungsoo Jung
Myoungsoo Jung
The KAIST Endowed Chair Professor | Full Professor, Department of Electrical Engineering, KAIST
Computer ArchitectureSolid State DriveNon-Volatile MemoryCXLOperating Systems