In-Transit Data Transport Strategies for Coupled AI-Simulation Workflow Patterns

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the data transfer bottleneck in high-performance computing (HPC) environments for coupled AI-simulation workflows. It systematically investigates optimal data transfer strategies across two canonical scenarios: one-to-one (co-located) and many-to-one (multiple simulations feeding a single AI model). Leveraging the SimAI-Bench benchmark framework, we conduct the first end-to-end performance evaluation on the Aurora supercomputer, comparing node-local storage, DragonHPC, Redis, and Lustre. Results show that in the one-to-one setting, node-local storage and DragonHPC achieve the lowest latency and highest throughput; in the many-to-one setting, parallel file systems—particularly Lustre—significantly outperform alternatives due to superior aggregate bandwidth and scalable metadata handling. The key contribution is the empirical demonstration that workflow topology fundamentally dictates optimal transport strategy selection, providing reproducible performance models and evidence-based guidance for strategy selection in heterogeneous coupled computing systems.

Technology Category

Application Category

📝 Abstract

Coupled AI-Simulation workflows are becoming the major workloads for HPC facilities, and their increasing complexity necessitates new tools for performance analysis and prototyping of new in-situ workflows. We present SimAI-Bench, a tool designed to both prototype and evaluate these coupled workflows. In this paper, we use SimAI-Bench to benchmark the data transport performance of two common patterns on the Aurora supercomputer: a one-to-one workflow with co-located simulation and AI training instances, and a many-to-one workflow where a single AI model is trained from an ensemble of simulations. For the one-to-one pattern, our analysis shows that node-local and DragonHPC data staging strategies provide excellent performance compared Redis and Lustre file system. For the many-to-one pattern, we find that data transport becomes a dominant bottleneck as the ensemble size grows. Our evaluation reveals that file system is the optimal solution among the tested strategies for the many-to-one pattern.

Problem

Research questions and friction points this paper is trying to address.

Analyzing data transport strategies for coupled AI-simulation workflows

Benchmarking performance of one-to-one and many-to-one workflow patterns

Identifying optimal data transport solutions for HPC environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

SimAI-Bench tool for prototyping and evaluating workflows

Node-local and DragonHPC staging for one-to-one pattern

File system as optimal solution for many-to-one pattern

🔎 Similar Papers

AI-coupled HPC Workflow Applications, Middleware and Performance

2024-06-20arXiv.orgCitations: 10

💼 Related Jobs

AI/HPC System Performance Engineer