Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads

๐Ÿ“… 2025-09-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Addressing the challenges of concurrent scheduling and low resource utilization for hybrid HPC and machine learning tasks in scientific workflows, this paper proposes a hierarchical, cooperative runtime integration framework. It achieves the first deep coupling of RADICAL-Pilot, Flux, and Dragonโ€”enabling high-throughput, dynamic scheduling of heterogeneous workloads. Evaluated on the Frontier exascale supercomputer, the integrated RP+Flux system sustains 930 tasks/sec, while RP+Flux+Dragon scales to over 1,500 tasks/sec with >99.6% resource utilization. In a real-world drug discovery workflow, end-to-end task completion time is reduced by 30โ€“60%, and throughput increases by more than 4ร—. The core innovation lies in cross-layer resource abstraction and co-optimized scheduling for ultra-large-scale task graphs, significantly outperforming conventional Slurm/srun-based approaches in both scalability and efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Scientific workflows increasingly involve both HPC and machine-learning tasks, combining MPI-based simulations, training, and inference in a single execution. Launchers such as Slurm's srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance study of RADICAL-Pilot (RP) integrated with Flux and Dragon, two complementary runtime systems that enable hierarchical resource management and high-throughput function execution. Using synthetic and production-scale workloads on Frontier, we characterize the task execution properties of RP across runtime configurations. RP+Flux sustains up to 930 tasks/s, and RP+Flux+Dragon exceeds 1,500 tasks/s with over 99.6% utilization. In contrast, srun peaks at 152 tasks/s and degrades with scale, with utilization below 50%. For IMPECCABLE.v2 drug discovery campaign, RP+Flux reduces makespan by 30-60% relative to srun/Slurm and increases throughput more than four times on up to 1,024. These results demonstrate hybrid runtime integration in RP as a scalable approach for hybrid AI-HPC workloads.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of Slurm's srun for dynamic heterogeneous workloads
Integrating RADICAL-Pilot with Flux and Dragon runtime systems
Improving task throughput and resource utilization for AI-HPC workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated RADICAL-Pilot with Flux and Dragon runtimes
Enabled hierarchical resource management for hybrid workloads
Achieved high-throughput task execution exceeding 1500 tasks/s
๐Ÿ”Ž Similar Papers
No similar papers found.
Andre Merzky
Andre Merzky
Rutgers University
M
Mikhail Titov
Brookhaven National Laboratory
M
Matteo Turilli
Rutgers University โ€“ New Brunswick, IE University
Shantenu Jha
Shantenu Jha
Rutgers University and Brookhaven National Laboratory
High-performance and Distributed ComputingCyberinfrastructureComputational Science