🤖 AI Summary
Automated de novo structural elucidation of organic molecules (≤40 non-hydrogen atoms, including C/N/O/H/P/S/Si/B/halogens) solely from 1D $^1$H/$^{13}$C NMR spectra remains a longstanding challenge.
Method: This work introduces, for the first time, a Transformer-based architecture for NMR-driven de novo structure generation, modeling molecular graphs as learnable sequences to enable end-to-end spectral-to-structural mapping. The approach overcomes combinatorial explosion by jointly learning spectral representations and graph generation, supports full elemental coverage, and enables fine-tuning on experimental data.
Contribution/Results: Evaluated on a mainstream drug-like chemical space, the method achieves 55.2% top-1 accuracy within the top-15 predictions—substantially outperforming conventional approaches. It establishes a scalable, high-accuracy deep learning paradigm for NMR-based structural elucidation, enabling robust, data-efficient, and element-agnostic molecular inference.
📝 Abstract
One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from $10^{20} - 10^{60}$. The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional $^1$H and/or $^{13}$C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 55.2% accuracy within the first 15 predictions using only the $^1$H and $^{13}$C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.