🤖 AI Summary
Atmospheric new particle formation remains a major source of uncertainty in climate modeling. To address this, we propose an efficient and interpretable machine learning method for predicting molecular cluster energies. Our approach employs a metric-learning-enhanced k-nearest neighbors regression framework, integrating the FCHL19 chemical descriptor with a kernel-induced distance metric (MLKR). Trained on the QM9 dataset and a large atmospheric cluster dataset (>250,000 structures), the model achieves near-chemical accuracy (MAE ≈ 1 kcal/mol). It accelerates energy predictions by several orders of magnitude relative to conventional quantum-chemical methods while maintaining strong extrapolative capability and intrinsic interpretability. Crucially, it is the first model to reliably predict energies of large, previously unseen clusters without sacrificing accuracy. This enables scalable, trustworthy computational investigation of aerosol nucleation mechanisms—providing a robust foundation for improving climate models.
📝 Abstract
Understanding how atmospheric molecular clusters form and grow is key to resolving one of the biggest uncertainties in climate modelling: the formation of new aerosol particles. While quantum chemistry offers accurate insights into these early-stage clusters, its steep computational costs limit large-scale exploration. In this work, we present a fast, interpretable, and surprisingly powerful alternative: $k$-nearest neighbour ($k$-NN) regression model. By leveraging chemically informed distance metrics, including a kernel-induced metric and one learned via metric learning for kernel regression (MLKR), we show that simple $k$-NN models can rival more complex kernel ridge regression (KRR) models in accuracy, while reducing computational time by orders of magnitude. We perform this comparison with the well-established Faber-Christensen-Huang-Lilienfeld (FCHL19) molecular descriptor, but other descriptors (e.g., FCHL18, MBDF, and CM) can be shown to have similar performance. Applied to both simple organic molecules in the QM9 benchmark set and large datasets of atmospheric molecular clusters (sulphuric acid-water and sulphuric-multibase -base systems), our $k$-NN models achieve near-chemical accuracy, scale seamlessly to datasets with over 250,000 entries, and even appears to extrapolate to larger unseen clusters with minimal error (often nearing 1 kcal/mol). With built-in interpretability and straightforward uncertainty estimation, this work positions $k$-NN as a potent tool for accelerating discovery in atmospheric chemistry and beyond.