Guiding Application Users via Estimation of Computational Resources for Massively Parallel Chemistry Computations

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Optimizing both node count and tile size jointly for large-scale parallel coupled-cluster (CCSD) computations remains challenging, particularly in balancing minimal execution time against minimal node-hours. Method: We propose a machine learning–based resource prediction framework that unifies optimization for these two objectives. Leveraging gradient-boosted regression models integrated with active learning, the framework achieves high-fidelity modeling with only ~450 experiments on the Frontier and Aurora exascale systems, and supports cross-platform parameter transfer to reduce data acquisition overhead. Contribution/Results: The framework achieves mean absolute percentage errors (MAPE) of 0.023 on Frontier and 0.073 on Aurora for iterative time prediction, demonstrating high accuracy and strong generalization. To our knowledge, this is the first work enabling joint time–resource optimization for CCSD calculations, delivering deployable, intelligent scheduling support for quantum chemistry simulations.

Technology Category

Application Category

📝 Abstract
In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size. We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier.
Problem

Research questions and friction points this paper is trying to address.

Predict computational resource costs for parallel chemistry computations
Determine optimal runtime parameters like node count and tile sizes
Address shortest-time and cheapest-run configurations for supercomputer users
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning predicts computational costs for chemistry simulations
Active learning reduces data collection needs for resource estimation
Gradient Boosting models optimize supercomputer runtime parameters
🔎 Similar Papers
No similar papers found.
T
Tanzila Tabassum
Louisiana State University, Louisiana, USA
O
Omer Subasi
Pacific Northwest National Laboratory, Washington, USA
A
Ajay Panyala
Pacific Northwest National Laboratory, Washington, USA
E
Epiya Ebiapia
Louisiana State University, Louisiana, USA
Gerald Baumgartner
Gerald Baumgartner
Louisiana State University, Louisiana, USA
E
Erdal Mutlu
Pacific Northwest National Laboratory, Washington, USA
P
P. (Saday) Sadayappan
University of Utah, Utah, USA
Karol Kowalski
Karol Kowalski
Pacific Northwest National Laboratory
many-body theories