Machine Learning-Driven Predictive Resource Management in Complex Science Workflows

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurately estimating resource requirements for scientific workflows remains challenging due to diverse analytical scenarios, varying user expertise, and highly heterogeneous computing platforms. To address this, we propose an end-to-end machine learning framework that directly learns CPU, memory, and runtime requirements for each workflow step from historical task execution profiles—eliminating reliance on domain-specific heuristics or time-consuming two-stage trial runs. Integrated into the PanDA workflow management system, our framework enables proactive, fine-grained dynamic resource pre-allocation. Experimental evaluation in large-scale scientific computing environments—including the Large Hadron Collider (LHC)—demonstrates that our approach significantly outperforms baseline methods: average resource waste is reduced by 32%, scheduling latency decreases by 27%, and heterogeneous resource utilization and workflow execution stability are substantially improved.

Technology Category

Application Category

📝 Abstract
The collaborative efforts of large communities in science experiments, often comprising thousands of global members, reflect a monumental commitment to exploration and discovery. Recently, advanced and complex data processing has gained increasing importance in science experiments. Data processing workflows typically consist of multiple intricate steps, and the precise specification of resource requirements is crucial for each step to allocate optimal resources for effective processing. Estimating resource requirements in advance is challenging due to a wide range of analysis scenarios, varying skill levels among community members, and the continuously increasing spectrum of computing options. One practical approach to mitigate these challenges involves initially processing a subset of each step to measure precise resource utilization from actual processing profiles before completing the entire step. While this two-staged approach enables processing on optimal resources for most of the workflow, it has drawbacks such as initial inaccuracies leading to potential failures and suboptimal resource usage, along with overhead from waiting for initial processing completion, which is critical for fast-turnaround analyses. In this context, our study introduces a novel pipeline of machine learning models within a comprehensive workflow management system, the Production and Distributed Analysis (PanDA) system. These models employ advanced machine learning techniques to predict key resource requirements, overcoming challenges posed by limited upfront knowledge of characteristics at each step. Accurate forecasts of resource requirements enable informed and proactive decision-making in workflow management, enhancing the efficiency of handling diverse, complex workflows across heterogeneous resources.
Problem

Research questions and friction points this paper is trying to address.

Predicting resource needs for complex scientific workflows
Overcoming inaccurate initial resource estimation challenges
Enabling optimal resource allocation using machine learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning predicts resource needs
Overcomes limited upfront knowledge challenges
Enables proactive workflow management decisions
🔎 Similar Papers
No similar papers found.
Tasnuva Chowdhury
Tasnuva Chowdhury
The European Organization for Nuclear Research (CERN), Geneva and Brookhaven National Laboratory
Particle Physics
T
Tadashi Maeno
Brookhaven National Laboratory, Upton, NY, USA.
F
Fatih Furkan Akman
University of Massachusetts, Amherst, MA, USA.
J
Joseph Boudreau
University of Pittsburgh, Pittsburgh, PA, USA.
S
Sankha Dutta
Brookhaven National Laboratory, Upton, NY, USA.
Shengyu Feng
Shengyu Feng
Carnegie Mellon University
Combinatorial OptimizationLanguage Models
Adolfy Hoisie
Adolfy Hoisie
Department Chair, Brookaven National Laboratory
Computer sciencecomputer architecturemodeling and simulation
K
Kuan-Chieh Hsu
Brookhaven National Laboratory, Upton, NY, USA.
R
Raees Khan
University of Pittsburgh, Pittsburgh, PA, USA.
J
Jaehyung Kim
Carnegie Mellon University, Pittsburgh, PA, USA.
O
Ozgur O. Kilic
Brookhaven National Laboratory, Upton, NY, USA.
Scott Klasky
Scott Klasky
Oak Ridge National Laboratory
Computer SciencePhysicsHigh Performance Computingdata science
A
Alexei Klimentov
Brookhaven National Laboratory, Upton, NY, USA.
T
Tatiana Korchuganova
University of Pittsburgh, Pittsburgh, PA, USA.
V
Verena Ingrid Martinez Outschoorn
University of Massachusetts, Amherst, MA, USA.
P
Paul Nilsson
Brookhaven National Laboratory, Upton, NY, USA.
D
David K. Park
Brookhaven National Laboratory, Upton, NY, USA.
Norbert Podhorszki
Norbert Podhorszki
Workflow Systems Group, Oak Ridge National Laboratory
Parallel I/Oin situScientific Workflows HPCBig Data
Yihui Ren
Yihui Ren
Brookhaven National Laboratory
artificial intellegencephysicsnetwork sciencecomputer science
J
John Rembrandt Steele
University of Massachusetts, Amherst, MA, USA.
Frédéric Suter
Frédéric Suter
Oak Ridge National Laboratory, IEEE Senior member
Computer ScienceWorkflowSchedulingSimulation
Sairam Sri Vatsavai
Sairam Sri Vatsavai
Research Associate, Brookhaven National Lab
AI Accelerator ModellingSilicon PhotonicsReservoir ComputingPhotonic Network on Chip
T
Torre Wenaus
Brookhaven National Laboratory, Upton, NY, USA.
W
Wei Yang
SLAC National Accelerator Laboratory, Menlo Park, CA, USA.
Y
Yiming Yang
Carnegie Mellon University, Pittsburgh, PA, USA.