🤖 AI Summary
Accurately estimating resource requirements for scientific workflows remains challenging due to diverse analytical scenarios, varying user expertise, and highly heterogeneous computing platforms. To address this, we propose an end-to-end machine learning framework that directly learns CPU, memory, and runtime requirements for each workflow step from historical task execution profiles—eliminating reliance on domain-specific heuristics or time-consuming two-stage trial runs. Integrated into the PanDA workflow management system, our framework enables proactive, fine-grained dynamic resource pre-allocation. Experimental evaluation in large-scale scientific computing environments—including the Large Hadron Collider (LHC)—demonstrates that our approach significantly outperforms baseline methods: average resource waste is reduced by 32%, scheduling latency decreases by 27%, and heterogeneous resource utilization and workflow execution stability are substantially improved.
📝 Abstract
The collaborative efforts of large communities in science experiments, often comprising thousands of global members, reflect a monumental commitment to exploration and discovery. Recently, advanced and complex data processing has gained increasing importance in science experiments. Data processing workflows typically consist of multiple intricate steps, and the precise specification of resource requirements is crucial for each step to allocate optimal resources for effective processing. Estimating resource requirements in advance is challenging due to a wide range of analysis scenarios, varying skill levels among community members, and the continuously increasing spectrum of computing options. One practical approach to mitigate these challenges involves initially processing a subset of each step to measure precise resource utilization from actual processing profiles before completing the entire step. While this two-staged approach enables processing on optimal resources for most of the workflow, it has drawbacks such as initial inaccuracies leading to potential failures and suboptimal resource usage, along with overhead from waiting for initial processing completion, which is critical for fast-turnaround analyses. In this context, our study introduces a novel pipeline of machine learning models within a comprehensive workflow management system, the Production and Distributed Analysis (PanDA) system. These models employ advanced machine learning techniques to predict key resource requirements, overcoming challenges posed by limited upfront knowledge of characteristics at each step. Accurate forecasts of resource requirements enable informed and proactive decision-making in workflow management, enhancing the efficiency of handling diverse, complex workflows across heterogeneous resources.