Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In machine learning force field (MLFF) training, chemical diversity in datasets is often compromised by sampling bias across the potential energy surface (PES), while conventional clustering and pruning methods fail in high-dimensional atomic descriptor spaces. To address this, we propose MEAGraph—a novel unsupervised graph autoencoder integrating multi-kernel linear transformations and edge-wise attention mechanisms—capable of learning geometry-sensitive atomic environment representations without labeled data. MEAGraph effectively disentangles distinct PES regions, enabling accurate clustering and efficient, lightweight dataset pruning. Validation on Nb, Ta, and Fe systems demonstrates substantial improvements in chemical diversity and distributional balance of training data, leading to enhanced MLFF generalization performance. The framework establishes a scalable, label-free paradigm for constructing high-quality, chemically diverse datasets for robust force field training.

Technology Category

Application Category

📝 Abstract
Constructing a chemically diverse dataset while avoiding sampling bias is critical to training efficient and generalizable force fields. However, in computational chemistry and materials science, many common dataset generation techniques are prone to oversampling regions of the potential energy surface. Furthermore, these regions can be difficult to identify and isolate from each other or may not align well with human intuition, making it challenging to systematically remove bias in the dataset. While traditional clustering and pruning (down-sampling) approaches can be useful for this, they can often lead to information loss or a failure to properly identify distinct regions of the potential energy surface due to difficulties associated with the high dimensionality of atomic descriptors. In this work, we introduce the Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) model, an unsupervised approach for analyzing atomic datasets. MEAGraph combines multiple linear kernel transformations with attention-based message passing to capture geometric sensitivity and enable effective dataset pruning without relying on labels or extensive training. Demonstrated applications on niobium, tantalum, and iron datasets show that MEAGraph efficiently groups similar atomic environments, allowing for the use of basic pruning techniques for removing sampling bias. This approach provides an effective method for representation learning and clustering that can be used for data analysis, outlier detection, and dataset optimization.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised method identifies sampling bias in atomic datasets
Multi-kernel graph autoencoder clusters high-dimensional atomic environments
Prunes datasets to remove bias without energy labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised graph autoencoder for atomic data mining
Multi-kernel attention for geometric sensitivity capture
Label-free clustering and pruning of atomic environments
🔎 Similar Papers
No similar papers found.
Hong Sun
Hong Sun
LLNL
Machine learningMaterials science
J
Joshua A. Vita
Materials Science Division, Lawrence Livermore National Laboratory, Livermore, CA 94551
A
Amit Samanta
Physics Division, Lawrence Livermore National Laboratory, Livermore, CA 94551
Vincenzo Lordi
Vincenzo Lordi
Lawrence Livermore National Laboratory
Computational Materials ScienceSemiconductorsSpectroscopyRenewable EnergyQuantum Information Science