🤖 AI Summary
Traditional message-passing interfaces (e.g., MPI) incur substantial communication overhead in HPC systems, limiting scalability. Method: This work explores the performance potential of CXL.mem-enabled cross-node direct memory access (i.e., message-free communication), proposing a fine-grained, per-MPI-call performance modeling framework. It integrates memory-access tracing of MPI buffers via the extended tool Mitos with offline traffic analysis and a lightweight analytical model. Contribution/Results: We present the first systematic evaluation of CXL.mem’s optimization potential for real-world HPC applications—2D heat conduction and HPCG—identifying specific MPI call types and execution scenarios that benefit most from message-free communication. The model achieves prediction errors under 12%, providing empirically validated, targeted guidance for designing CXL.mem-native communication APIs and optimizing MPI runtimes.
📝 Abstract
Heterogeneous memory technologies are increasingly important instruments in addressing the memory wall in HPC systems. While most are deployed in single node setups, CXL.mem is a technology that implements memories that can be attached to multiple nodes simultaneously, enabling shared memory pooling. This opens new possibilities, particularly for efficient inter-node communication.
In this paper, we present a novel performance evaluation toolchain combined with an extended performance model for message-based communication, which can be used to predict potential performance benefits from using CXL.mem for data exchange. Our approach analyzes data access patterns of MPI applications: it analyzes on-node accesses to/from MPI buffers, as well as cross-node MPI traffic to gather a full understanding of the impact of memory performance. We combine this data in an extended performance model to predict which data transfers could benefit from direct CXL.mem implementations as compared to traditional MPI messages. Our model works on a per-MPI call granularity, allowing the identification and later optimizations of those MPI invocations in the code with the highest potential for speedup by using CXL.mem.
For our toolchain, we extend the memory trace sampling tool Mitos and use it to extract data access behavior. In the post-processing step, the raw data is automatically analyzed to provide performance models for each individual MPI call. We validate the models on two sample applications -- a 2D heat transfer miniapp and the HPCG benchmark -- and use them to demonstrate their support for targeted optimizations by integrating CXL.mem.