🤖 AI Summary
This study addresses the lack of systematic comparison between asynchronous multitasking (AMT) runtime systems and MPI in terms of both performance and programming productivity. Leveraging the Task Bench framework, it presents the first unified benchmark incorporating Itoyori and ItoyoriFBC alongside HPX and MPI, evaluating them across diverse workloads using PGAS abstractions, RDMA-based work stealing, and future-based synchronization. Quantitative analysis via application efficiency, METG, lines of code, and library constructs reveals distinct trade-offs: Itoyori achieves the highest efficiency with the most concise code; MPI excels in regular, low-communication tasks yet requires verbose implementations; HPX demonstrates robust stability but lowest productivity; and ItoyoriFBC offers enhanced expressiveness at a modest performance cost.
📝 Abstract
Asynchronous Many-Task (AMT) runtimes offer a productive alternative to the Message Passing Interface (MPI). However, the diverse AMT landscape makes fair comparisons challenging. Task Bench, proposed by Slaughter et al., addresses this challenge through a parameterized framework for evaluating parallel programming systems. This work integrates two recent cluster AMTs, Itoyori and ItoyoriFBC, into Task Bench for comprehensive evaluation against MPI and HPX. Itoyori employs a Partitioned Global Address Space (PGAS) model with RDMA-based work stealing, while ItoyoriFBC extends it with futurebased synchronization. We evaluate these systems in terms of both performance and programmer productivity. Performance is assessed across various configurations, including compute-bound kernels, weak scaling, and both imbalanced and communication-intensive patterns. Performance is quantified using application efficiency, i.e., the percentage of maximum performance achieved, and the Minimum Effective Task Granularity (METG), i.e., the smallest task duration before runtime overheads dominate. Programmer productivity is quantified using Lines of Code (LOC) and the Number of Library Constructs (NLC). Our results reveal distinct trade-offs. MPI achieves the highest efficiency for regular, communication-light workloads but requires verbose, lowlevel code. HPX maintains stable efficiency under load imbalance across varying node counts, yet ranks last in productivity metrics, demonstrating that AMTs do not inherently guarantee improved productivity over MPI. Itoyori achieves the highest efficiency in communication-intensive configurations while leading in programmer productivity. ItoyoriFBC exhibits slightly lower efficiency than Itoyori, though its future-based synchronization offers potential for expressing irregular workloads.