🤖 AI Summary
This work addresses the critical limitations in current NMR-based molecular structure elucidation research, which heavily relies on synthetic data, leading to severe domain shift when applied to real spectra, and suffers from a lack of standardized evaluation protocols and rigorous data splitting that often results in data leakage and unfair comparisons. To this end, we introduce NMRGym, the largest and most comprehensive benchmark to date, built upon high-quality experimental NMR data encompassing 269,999 molecules with high-fidelity ¹H and ¹³C spectra. The dataset features stringent quality control, uniform formatting, and a scaffold-aware data partitioning strategy, along with the first-ever atom-to-peak level fine-grained annotations. We further establish a multitask evaluation framework and an open-source automated leaderboard supporting tasks such as structure elucidation, functional group identification, toxicity prediction, and spectral simulation, significantly advancing standardization and reproducibility in NMR research.
📝 Abstract
Nuclear Magnetic Resonance (NMR) spectroscopy is the cornerstone of small-molecule structure elucidation. While deep learning has demonstrated significant potential in automating structure elucidation and spectral simulation, current progress is severely impeded by the reliance on synthetic datasets, which introduces significant domain shifts when applied to real-world experimental spectra. Furthermore, the lack of standardized evaluation protocols and rigorous data splitting strategies frequently leads to unfair comparisons and data leakage. To address these challenges, we introduce \textbf{NMRGym}, the largest and most comprehensive standardized dataset and benchmark derived from high-quality experimental NMR data to date. Comprising \textbf{269,999} unique molecules paired with high-fidelity $^1$H and $^{13}$C spectra, NMRGym bridges the critical gap between synthetic approximations and real-world diversity. We implement a strict quality control pipeline and unify data formats to ensure fair comparison. To strictly prevent data leakage, we enforce a scaffold-based split. Additionally, we provide fine-grained peak-atom level annotations to support future usage. Leveraging this resource, we establish a comprehensive evaluation suite covering diverse downstream tasks, including structure elucidation, functional group prediction from NMR, toxicity prediction from NMR, and spectral simulation, benchmarking representative state-of-the-art methodologies. Finally, we release an open-source leadboard with an automated leaderboard to foster community collaboration and standardize future research. The dataset, benchmark and leaderboard are publicly available at \textcolor{blue}{https://AIMS-Lab-HKUSTGZ.github.io/NMRGym/}.