Guiding Application Users via Estimation of Computational Resources for Massively Parallel Chemistry Computations

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Optimizing both node count and tile size jointly for large-scale parallel coupled-cluster (CCSD) computations remains challenging, particularly in balancing minimal execution time against minimal node-hours. Method: We propose a machine learning–based resource prediction framework that unifies optimization for these two objectives. Leveraging gradient-boosted regression models integrated with active learning, the framework achieves high-fidelity modeling with only ~450 experiments on the Frontier and Aurora exascale systems, and supports cross-platform parameter transfer to reduce data acquisition overhead. Contribution/Results: The framework achieves mean absolute percentage errors (MAPE) of 0.023 on Frontier and 0.073 on Aurora for iterative time prediction, demonstrating high accuracy and strong generalization. To our knowledge, this is the first work enabling joint time–resource optimization for CCSD calculations, delivering deployable, intelligent scheduling support for quantum chemistry simulations.

Technology Category

Application Category

📝 Abstract

In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size. We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier.

Problem

Research questions and friction points this paper is trying to address.

Predict computational resource costs for parallel chemistry computations

Determine optimal runtime parameters like node count and tile sizes

Address shortest-time and cheapest-run configurations for supercomputer users

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning predicts computational costs for chemistry simulations

Active learning reduces data collection needs for resource estimation

Gradient Boosting models optimize supercomputer runtime parameters

🔎 Similar Papers

No similar papers found.

Genentech

New York City, New York, United States of America / South San Francisco, California, United States of America

Applied Scientist, Amazon Ads, Demand Forecasting & Guidance

Amazon

171,600.00 - 222,200.00 USD annually

Palo Alto, CA, USA

Authors to Follow