MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing molecular foundation models predominantly rely on SMILES representations, neglecting experimentally derived spectroscopic (NMR/IR/MS) and 3D structural information—limiting their performance in stereochemical analysis, conformational prediction, and experimental validation. To address this, we propose the first multimodal foundation model integrating experimental NMR, IR, and MS spectra with molecular 3D conformations. Built upon the Qwen2.5-7B architecture, it employs multi-task learning to unify SMILES, spectral, and spatial representations. Crucially, it enables end-to-end generation from spectra to SMILES to 3D conformations, bridging spectral interpretation, structural elucidation, and de novo design. On spectral classification, it achieves a mean accuracy of 0.53; for Spectra-to-SMILES generation, sequence accuracy reaches 15.5% and token accuracy 41.7%; and its 3D structure generation significantly outperforms general-purpose LLMs, enhancing practical utility in drug discovery and related domains.

Technology Category

Application Category

📝 Abstract

Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design.

Problem

Research questions and friction points this paper is trying to address.

Integrating spectroscopy data with molecular structure modeling

Overcoming limitations of SMILES-only molecular representations

Generating accurate 3D structures from spectral or SMILES inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates experimental spectroscopy with 3D molecular structure

Generates accurate 3D structures from SMILES or spectral inputs

Pretrained on Qwen2.5-7B foundation model architecture

🔎 Similar Papers

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization