🤖 AI Summary
Asynchronous Multi-Tasking (AMT) systems impose unique demands on underlying communication libraries when deployed on extreme-scale heterogeneous architectures—traditional MPI backends cannot adequately support AMT’s fine-grained, non-blocking, one-sided communication model. Method: We formally define the AMT communication abstraction and identify four fundamental communication requirements that extend beyond the Bulk Synchronous Parallel (BSP) model. Based on these insights, we design and implement a novel parcelport built upon the LCI experimental communication library, featuring a lightweight, explicitly progress-controllable architecture that supports one-sided communication and queued completion notifications, integrated with LCI’s native network layer to mitigate resource contention. Contribution/Results: Microbenchmarks demonstrate up to 50× speedup; real-world scientific applications achieve 2× performance improvement. Ablation studies quantitatively isolate and validate the individual contributions of each architectural component.
📝 Abstract
Asynchronous Many-Task (AMT) systems offer a potential solution for efficiently programming complicated scientific applications on extreme-scale heterogeneous architectures. However, they exhibit different communication needs from traditional bulk-synchronous parallel (BSP) applications, posing new challenges for underlying communication libraries. This work systematically studies the communication needs of AMTs and explores how communication libraries can be structured to better satisfy them through a case study of a real-world AMT system, HPX. We first examine its communication stack layout and formalize the communication abstraction that underlying communication libraries need to support. We then analyze its current MPI backend (parcelport) and identify four categories of needs that are not typical in the BSP model and are not well covered by the MPI standard. To bridge these gaps, we design from the native network layer and incorporate various techniques, including one-sided communication, queue-based completion notification, explicit progressing, and different ways of resource contention mitigation, in a new parcelport with an experimental communication library, LCI. Overall, the resulting LCI parcelport outperforms the existing MPI parcelport with up to 50x in microbenchmarks and 2x in a real-world application. Using it as a testbed, we design LCI parcelport variants to quantify the performance contributions of each technique. This work combines conceptual analysis and experiment results to offer a practical guideline for the future development of communication libraries and AMT communication layers.