🤖 AI Summary
Existing models struggle to unify heterogeneous chemical data—such as molecular structures, spectroscopic profiles, textual descriptions, and reaction equations—while failing to fully harness the capabilities of large language models.
Method: We introduce the first large-scale multimodal foundation model tailored for chemistry. It enables end-to-end joint modeling of SMILES, InChI, molecular graphs, IR/NMR spectra, and natural language. Our approach incorporates a chemistry-aware cross-modal alignment mechanism and a domain-adaptive pretraining paradigm, integrating a GNN-based molecular encoder, a CNN-based spectral branch, and multimodal adapters. Pretraining employs contrastive learning and masked modality reconstruction.
Contribution/Results: The model achieves an average 9.3% improvement across 12 downstream tasks versus unimodal baselines, supports zero-shot cross-modal inference, and attains a 68.7% Top-1 accuracy on the USPTO-50K retrosynthetic prediction benchmark.