In-Transit Data Transport Strategies for Coupled AI-Simulation Workflow Patterns

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the data transfer bottleneck in high-performance computing (HPC) environments for coupled AI-simulation workflows. It systematically investigates optimal data transfer strategies across two canonical scenarios: one-to-one (co-located) and many-to-one (multiple simulations feeding a single AI model). Leveraging the SimAI-Bench benchmark framework, we conduct the first end-to-end performance evaluation on the Aurora supercomputer, comparing node-local storage, DragonHPC, Redis, and Lustre. Results show that in the one-to-one setting, node-local storage and DragonHPC achieve the lowest latency and highest throughput; in the many-to-one setting, parallel file systems—particularly Lustre—significantly outperform alternatives due to superior aggregate bandwidth and scalable metadata handling. The key contribution is the empirical demonstration that workflow topology fundamentally dictates optimal transport strategy selection, providing reproducible performance models and evidence-based guidance for strategy selection in heterogeneous coupled computing systems.

Technology Category

Application Category

📝 Abstract
Coupled AI-Simulation workflows are becoming the major workloads for HPC facilities, and their increasing complexity necessitates new tools for performance analysis and prototyping of new in-situ workflows. We present SimAI-Bench, a tool designed to both prototype and evaluate these coupled workflows. In this paper, we use SimAI-Bench to benchmark the data transport performance of two common patterns on the Aurora supercomputer: a one-to-one workflow with co-located simulation and AI training instances, and a many-to-one workflow where a single AI model is trained from an ensemble of simulations. For the one-to-one pattern, our analysis shows that node-local and DragonHPC data staging strategies provide excellent performance compared Redis and Lustre file system. For the many-to-one pattern, we find that data transport becomes a dominant bottleneck as the ensemble size grows. Our evaluation reveals that file system is the optimal solution among the tested strategies for the many-to-one pattern.
Problem

Research questions and friction points this paper is trying to address.

Analyzing data transport strategies for coupled AI-simulation workflows
Benchmarking performance of one-to-one and many-to-one workflow patterns
Identifying optimal data transport solutions for HPC environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

SimAI-Bench tool for prototyping and evaluating workflows
Node-local and DragonHPC staging for one-to-one pattern
File system as optimal solution for many-to-one pattern
🔎 Similar Papers
No similar papers found.
Harikrishna Tummalapalli
Harikrishna Tummalapalli
Argonne Leadership Computing Facility
WorkflowsHigh performance computingTurbulent combustion
Riccardo Balin
Riccardo Balin
Argonne National Laboratory
CFDturbulence modelingmachine learning
C
Christine M. Simpson
Argonne National Laboratory, Lemont, IL, USA
A
Andrew Park
Rutgers University, New Brunswick, NJ, USA
Aymen Alsaadi
Aymen Alsaadi
Ph.D, Rutgers University
cloud computingHPCparallel processingworkflow management
A
Andrew E. Shao
Hewlett Packard Enterprise, Victoria, BC, Canada
W
Wesley Brewer
Oak Ridge National Laboratory, Oak Ridge, TN, USA
S
S. Jha
Rutgers University, New Brunswick, NJ, USA; Princeton Plasma Physics Laboratory, Princeton, NJ, USA