From Human Labels to Literature: Semi-Supervised Learning of NMR Chemical Shifts at Scale

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited scalability and high cost of existing machine learning approaches for predicting atomic-level NMR chemical shifts, which rely heavily on small amounts of manually annotated data. The authors propose a novel semi-supervised framework that leverages millions of weakly structured, unlabeled NMR spectra extracted from scientific literature, combined with a small set of labeled examples for large-scale training. Key innovations include a ranking-based permutation-invariant set supervision strategy, an approximate optimal bipartite matching algorithm, a tailored ranking loss function, and explicit modeling of solvent effects. The method substantially outperforms current state-of-the-art models in accuracy, robustness, and generalization, as demonstrated on larger and more diverse molecular datasets.

Technology Category

Application Category

📝 Abstract
Accurate prediction of nuclear magnetic resonance (NMR) chemical shifts is fundamental to spectral analysis and molecular structure elucidation, yet existing machine learning methods rely on limited, labor-intensive atom-assigned datasets. We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments, integrating a small amount of labeled data with large-scale unassigned spectra. We formulate chemical shift prediction from literature spectra as a permutation-invariant set supervision problem, and show that under commonly satisfied conditions on the loss function, optimal bipartite matching reduces to a sorting-based loss, enabling stable large-scale semi-supervised training beyond traditional curated datasets. Our models achieve substantially improved accuracy and robustness over state-of-the-art methods and exhibit stronger generalization on significantly larger and more diverse molecular datasets. Moreover, by incorporating solvent information at scale, our approach captures systematic solvent effects across common NMR solvents for the first time. Overall, our results demonstrate that large-scale unlabeled spectra mined from the literature can serve as a practical and effective data source for training NMR shift models, suggesting a broader role of literature-derived, weakly structured data in data-centric AI for science.
Problem

Research questions and friction points this paper is trying to address.

NMR chemical shifts
semi-supervised learning
literature-extracted spectra
atom-level assignments
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised learning
NMR chemical shifts
permutation-invariant supervision
literature-mined spectra
solvent effects
🔎 Similar Papers
No similar papers found.
Y
Yongqi Jin
School of Mathematical Sciences, Peking University, Beijing, China; AI for Science Institute, Beijing, China; Center for Machine Learning Research, Peking University, Beijing, China
Y
Yecheng Wang
School of Mathematical Sciences, Peking University, Beijing, China
J
Jun-jie Wang
DP Technology, Beijing, China; College of Chemistry and Molecular Engineering, Peking University, Beijing, China
Rong Zhu
Rong Zhu
Peking University
chemistry
Guolin Ke
Guolin Ke
DP Technology
Machine LearningAI for Science
E
E. Weinan
School of Mathematical Sciences, Peking University, Beijing, China; AI for Science Institute, Beijing, China; Center for Machine Learning Research, Peking University, Beijing, China