🤖 AI Summary
Optimizing both node count and tile size jointly for large-scale parallel coupled-cluster (CCSD) computations remains challenging, particularly in balancing minimal execution time against minimal node-hours. Method: We propose a machine learning–based resource prediction framework that unifies optimization for these two objectives. Leveraging gradient-boosted regression models integrated with active learning, the framework achieves high-fidelity modeling with only ~450 experiments on the Frontier and Aurora exascale systems, and supports cross-platform parameter transfer to reduce data acquisition overhead. Contribution/Results: The framework achieves mean absolute percentage errors (MAPE) of 0.023 on Frontier and 0.073 on Aurora for iterative time prediction, demonstrating high accuracy and strong generalization. To our knowledge, this is the first work enabling joint time–resource optimization for CCSD calculations, delivering deployable, intelligent scheduling support for quantum chemistry simulations.
📝 Abstract
In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size.
We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier.