🤖 AI Summary
This study addresses the data transfer bottleneck in high-performance computing (HPC) environments for coupled AI-simulation workflows. It systematically investigates optimal data transfer strategies across two canonical scenarios: one-to-one (co-located) and many-to-one (multiple simulations feeding a single AI model). Leveraging the SimAI-Bench benchmark framework, we conduct the first end-to-end performance evaluation on the Aurora supercomputer, comparing node-local storage, DragonHPC, Redis, and Lustre. Results show that in the one-to-one setting, node-local storage and DragonHPC achieve the lowest latency and highest throughput; in the many-to-one setting, parallel file systems—particularly Lustre—significantly outperform alternatives due to superior aggregate bandwidth and scalable metadata handling. The key contribution is the empirical demonstration that workflow topology fundamentally dictates optimal transport strategy selection, providing reproducible performance models and evidence-based guidance for strategy selection in heterogeneous coupled computing systems.
📝 Abstract
Coupled AI-Simulation workflows are becoming the major workloads for HPC facilities, and their increasing complexity necessitates new tools for performance analysis and prototyping of new in-situ workflows. We present SimAI-Bench, a tool designed to both prototype and evaluate these coupled workflows. In this paper, we use SimAI-Bench to benchmark the data transport performance of two common patterns on the Aurora supercomputer: a one-to-one workflow with co-located simulation and AI training instances, and a many-to-one workflow where a single AI model is trained from an ensemble of simulations. For the one-to-one pattern, our analysis shows that node-local and DragonHPC data staging strategies provide excellent performance compared Redis and Lustre file system. For the many-to-one pattern, we find that data transport becomes a dominant bottleneck as the ensemble size grows. Our evaluation reveals that file system is the optimal solution among the tested strategies for the many-to-one pattern.