🤖 AI Summary
This work addresses the limited scalability and high cost of existing machine learning approaches for predicting atomic-level NMR chemical shifts, which rely heavily on small amounts of manually annotated data. The authors propose a novel semi-supervised framework that leverages millions of weakly structured, unlabeled NMR spectra extracted from scientific literature, combined with a small set of labeled examples for large-scale training. Key innovations include a ranking-based permutation-invariant set supervision strategy, an approximate optimal bipartite matching algorithm, a tailored ranking loss function, and explicit modeling of solvent effects. The method substantially outperforms current state-of-the-art models in accuracy, robustness, and generalization, as demonstrated on larger and more diverse molecular datasets.
📝 Abstract
Accurate prediction of nuclear magnetic resonance (NMR) chemical shifts is fundamental to spectral analysis and molecular structure elucidation, yet existing machine learning methods rely on limited, labor-intensive atom-assigned datasets. We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments, integrating a small amount of labeled data with large-scale unassigned spectra. We formulate chemical shift prediction from literature spectra as a permutation-invariant set supervision problem, and show that under commonly satisfied conditions on the loss function, optimal bipartite matching reduces to a sorting-based loss, enabling stable large-scale semi-supervised training beyond traditional curated datasets. Our models achieve substantially improved accuracy and robustness over state-of-the-art methods and exhibit stronger generalization on significantly larger and more diverse molecular datasets. Moreover, by incorporating solvent information at scale, our approach captures systematic solvent effects across common NMR solvents for the first time. Overall, our results demonstrate that large-scale unlabeled spectra mined from the literature can serve as a practical and effective data source for training NMR shift models, suggesting a broader role of literature-derived, weakly structured data in data-centric AI for science.