Active Learning for Machine Learning Driven Molecular Dynamics

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine learning–driven coarse-grained molecular dynamics (CG-MD) suffers from potential energy degradation and poor generalization in unexplored conformational regions. Method: This paper proposes an active learning–based online optimization framework that integrates an RMSD-driven frame selection strategy with the CGSchNet neural network potential model. It dynamically identifies gaps in conformational space coverage and queries a full-atom simulation “oracle” to generate high-fidelity training data on-the-fly. Contribution/Results: Compared to static training paradigms, the framework significantly enhances model self-correction capability and computational efficiency. Experiments on the Chignolin protein show a 33.05% reduction in Wasserstein-1 distance within tICA-embedded space relative to baseline methods, demonstrating improved exploration and modeling accuracy for unseen conformations. This work establishes a scalable, adaptive paradigm for CG-MD simulation.

Technology Category

Application Category

📝 Abstract
Machine learned coarse grained (CG) potentials are fast, but degrade over time when simulations reach undersampled biomolecular conformations, and generating widespread all atom (AA) data to combat this is computationally infeasible. We propose a novel active learning framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD) based frame selection from MD simulations in order to generate data on the fly by querying an oracle during the training of a neural network potential. This framework preserves CG level efficiency while correcting the model at precise, RMSD identified coverage gaps. By training CGSchNet, a coarse grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05% improvement in the Wasserstein 1 (W1) metric in Time lagged Independent Component Analysis (TICA) space on an in house benchmark suite.
Problem

Research questions and friction points this paper is trying to address.

Addressing degradation of machine learned coarse-grained potentials in undersampled biomolecular conformations
Developing active learning framework to generate training data efficiently during molecular dynamics simulations
Improving neural network potential accuracy by targeting unexplored regions of conformational space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active learning framework for CG neural network potentials
RMSD-based frame selection from MD simulations
On-the-fly data generation by querying oracle during training
K
Kevin Bachelor
Baskin Engineering, University of California - Santa Cruz, Santa Cruz, CA
S
Sanya Murdeshwar
Baskin Engineering, University of California - Santa Cruz, Santa Cruz, CA
D
Daniel Sabo
Baskin Engineering, University of California - Santa Cruz, Santa Cruz, CA
Razvan Marinescu
Razvan Marinescu
Assistant Professor, UC Santa Cruz, Computer Science and Engineering, Genomics Institute
Machine LearningDifferentiable SimulatorsBayesian ModelingMedical Image AnalysisMRI