🤖 AI Summary
This work investigates whether machine-learned interatomic potentials possess compositional generalization capabilities—specifically, the ability to predict properties of unseen molecules based on compositional rules of chemical structure rather than merely memorizing training data. To this end, we introduce the first benchmark for compositional generalization in interatomic potentials, comprising four tasks with carefully designed train/test splits that evaluate state-of-the-art models, including large-scale pretrained architectures, on out-of-distribution molecules. Our experiments reveal that even models pretrained on millions of molecules exhibit errors on out-of-distribution samples that are an order of magnitude higher than those on in-distribution data, exposing fundamental limitations in their compositional generalization and underscoring the significant challenge this problem poses for current approaches.
📝 Abstract
Machine Learning Interatomic Potentials play a fundamental role in computational chemistry and materials science, enabling applications from molecular dynamics simulations to drug design and materials discovery. While recent approaches can estimate inter-atomic forces with high precision, it remains unclear to what extent they can generalise to previously unseen molecules. Do they learn the compositional structure of chemistry, capturing how molecular fragments and their combinations determine properties, or do they primarily learn to interpolate patterns that are specific to the training examples? To address this question, we propose a benchmark consisting of four tasks that require some form of compositional generalisation. In each task, models are tested on molecules that were unseen during training, but the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles. Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples, even when using foundation models that have been pre-trained on millions of molecules.