Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data distribution imbalance and performance bottlenecks in symmetric tensor contraction kernels caused by geometric graph size heterogeneity in chemical foundation model (CFM) training, this work targets the MACE model and formulates CFM batch construction as a multi-objective bin-packing problem—its first such formulation. We propose a load-balanced distributed scheduling method to optimize batch composition across devices. Concurrently, we systematically identify and deeply optimize MACE’s critical symmetric tensor contraction kernels, integrating GNN training acceleration, high-performance tensor computation, and CUDA kernel-level optimizations. Evaluated on 740 GPUs training 2.6 million samples, our approach reduces per-epoch runtime from 12 minutes to 2 minutes—a 6× speedup—while significantly enhancing training efficiency and scalability for large-scale CFMs.

Technology Category

Application Category

📝 Abstract
Chemistry Foundation Models (CFMs) that leverage Graph Neural Networks (GNNs) operating on 3D molecular graph structures are becoming indispensable tools for computational chemists and materials scientists. These models facilitate the understanding of matter and the discovery of new molecules and materials. In contrast to GNNs operating on a large homogeneous graphs, GNNs used by CFMs process a large number of geometric graphs of varying sizes, requiring different optimization strategies than those developed for large homogeneous GNNs. This paper presents optimizations for two critical phases of CFM training: data distribution and model training, targeting MACE - a state-of-the-art CFM. We address the challenge of load balancing in data distribution by formulating it as a multi-objective bin packing problem. We propose an iterative algorithm that provides a highly effective, fast, and practical solution, ensuring efficient data distribution. For the training phase, we identify symmetric tensor contraction as the key computational kernel in MACE and optimize this kernel to improve the overall performance. Our combined approach of balanced data distribution and kernel optimization significantly enhances the training process of MACE. Experimental results demonstrate a substantial speedup, reducing per-epoch execution time for training from 12 to 2 minutes on 740 GPUs with a 2.6M sample dataset.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data distribution for efficient CFM training
Improving kernel performance in MACE model training
Balancing load in heterogeneous molecular graph processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes data distribution via multi-objective bin packing
Enhances kernel performance with symmetric tensor contraction
Achieves efficient training for chemistry foundation models
🔎 Similar Papers
No similar papers found.
J
Jesun Firoz
Pacific Northwest National Laboratory, USA
F
Franco Pellegrini
SISSA, Italy
Mario Geiger
Mario Geiger
MIT
neural network
D
Darren Hsu
NVIDIA, USA
J
Jenna A. Bilbrey
Pacific Northwest National Laboratory, USA
H
Han-Yi Chou
NVIDIA, USA
Maximilian Stadler
Maximilian Stadler
Technische Universität München
Machine LearningUncertainty Estimation
M
Markus Hoehnerbach
NVIDIA, USA
T
Tingyu Wang
NVIDIA, USA
D
Dejun Lin
NVIDIA, USA
Emine Kucukbenli
Emine Kucukbenli
International School for Advanced Studies (SISSA), Trieste
molecular crystal structure predictionab initio NMRab initio van der WaalsDFT+Hubbardmachine learning with DFT
H
Henry W. Sprueill
Pacific Northwest National Laboratory, USA
Ilyes Batatia
Ilyes Batatia
PhD Student, University of Cambridge
Theoretical ChemistryForce FieldsGeometric Deep Learning
Sotiris S. Xantheas
Sotiris S. Xantheas
Laboratory Fellow, PNNL / Chemistry, University of Washington
Theoretical Chemistry Computational Chemistry Intermolecular Interactions Hydrogen Bonding Interaction Potentials
M
MalSoon Lee
Pacific Northwest National Laboratory, USA
C
Chris Mundy
Pacific Northwest National Laboratory, USA
G
Gabor Csanyi
University of Cambridge, UK
J
Justin S. Smith
NVIDIA, USA
P
Ponnuswamy Sadayappan
University of Utah, USA
Sutanay Choudhury
Sutanay Choudhury
Pacific Northwest National Laboratory
Artificial Intelligence