Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end heterogeneous pipelines for scientific AI face challenges including fragmented data preprocessing, distributed training, and post-processing stages, as well as poor cross-platform portability. This paper introduces a novel heterogeneous runtime system that tightly integrates the Cylon data framework with the Radical Pilot task scheduler, enabling unified support for end-to-end neural forecasting and hydrological modeling across cloud and HPC environments. We propose the first communication abstraction unifying MPI, GLOO, and NCCL to enable seamless accelerator-agnostic coordination across preprocessing, training, and post-processing phases. Evaluation shows end-to-end latency reductions of 3.28 seconds (neural forecasting) and 75.9 seconds (hydrological modeling) versus baseline systems, significantly accelerating scientific workflows. The open-source framework provides a scalable, flexible infrastructure for scientific AI applications in genomics, climate modeling, and astronomy.

Technology Category

Application Category

📝 Abstract
Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.
Problem

Research questions and friction points this paper is trying to address.

Manages complex data for deep learning in scientific domains.
Integrates scalable runtime tools with deep learning frameworks.
Supports heterogeneous systems and multi-node setups for distributed pipelines.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous runtime system for HPC environments
Integration of data engineering and deep learning frameworks
Parallel and distributed deep learning pipelines
🔎 Similar Papers
No similar papers found.
Arup Kumar Sarker
Arup Kumar Sarker
PhD Student, Computer Science, University of Virginia
Deep LearningHPCDistributed ComputingComputer VisionAutonomous Vehicles
Aymen Alsaadi
Aymen Alsaadi
Ph.D, Rutgers University
cloud computingHPCparallel processingworkflow management
A
Alexander James Halpern
Department of Computer Science, University of Virginia, Charlottesville, VA 22904
P
Prabhath Tangella
Department of Computer Science, University of Virginia, Charlottesville, VA 22904
M
Mikhail Titov
Brookhaven National Laboratory, Upton, NY
G
G. V. Laszewski
Biocomplexity Institute and Initiative, Town Center Four, 994 Research Park Boulevard Charlottesville, VA 22911
Shantenu Jha
Shantenu Jha
Rutgers University and Brookhaven National Laboratory
High-performance and Distributed ComputingCyberinfrastructureComputational Science
G
Geoffrey C. Fox
Department of Computer Science, University of Virginia, Charlottesville, VA 22904; Biocomplexity Institute and Initiative, Town Center Four, 994 Research Park Boulevard Charlottesville, VA 22911