🤖 AI Summary
As HPC architectures grow increasingly complex and irregular scientific algorithms demand efficient asynchronous multithreaded communication, existing MPI abstractions face scalability limitations in AMT (Asynchronous Multi-Threaded) runtime environments. Method: This paper systematically evaluates the applicability and performance bottlenecks of MPI’s Virtual Communication Interface (VCI) and Continuation extensions within the HPX AMT runtime, via an MPI-level microbenchmark built atop HPX’s native communication layer, deep integration of both extensions into HPX, and combined theoretical modeling and empirical evaluation. Contribution/Results: We identify a critical multithreaded message-rate bottleneck under multiple VCIs: single-thread-per-VCI scaling is hampered by resource contention and scheduling overhead, while intra-VCI thread efficiency emerges as the primary scalability limiter. Although the extensions outperform standard MPI, significant optimization opportunities remain in multithreaded throughput and end-to-end latency. This work provides key empirical evidence and concrete guidance for MPI standardization efforts targeting AMT-aware high-performance computing.
📝 Abstract
The increasing complexity of HPC architectures and the growing adoption of irregular scientific algorithms demand efficient support for asynchronous, multithreaded communication. This need is especially pronounced with Asynchronous Many-Task (AMT) systems. This communication pattern was not a consideration during the design of the original MPI specification. The MPI community has recently introduced several extensions to address these evolving requirements. This work evaluates two such extensions, the Virtual Communication Interface (VCI) and the Continuation extensions, in the context of an established AMT runtime HPX. We begin by using an MPI-level microbenchmark, modeled from HPX's low-level communication mechanism, to measure the peak performance potential of these extensions. We then integrate them into HPX to evaluate their effectiveness in real-world scenarios. Our results show that while these extensions can enhance performance compared to standard MPI, areas for improvement remain. The current continuation proposal limits the maximum multithreaded message rate achievable in the multi-VCI setting. Furthermore, the recommended one-VCI-per-thread mode proves ineffective in real-world systems due to the attentiveness problem. These findings underscore the importance of improving intra-VCI threading efficiency to achieve scalable multithreaded communication and fully realize the benefits of recent MPI extensions.