MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

📅 2024-11-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chemical literature and patents contain vast amounts of critical information encoded as molecular structure diagrams—especially Markush structures—yet existing optical chemical structure recognition (OCSR) methods suffer from limited performance on real-world documents due to poor image quality, heterogeneous drawing styles, noise corruption, and the inherent complexity of Markush representations. To address this, we propose the first end-to-end OCSR framework designed for in-the-wild data. Our approach introduces MolParser-7M, a large-scale, real-scenario molecular image dataset; a novel extended SMILES encoding scheme; and a unified architecture integrating vision-language joint modeling, synthetic data augmentation, active learning-based sampling, and curriculum learning. Evaluated across multiple realistic benchmarks, our method significantly outperforms both classical and deep learning-based OCSR approaches, achieving state-of-the-art robustness and accuracy. The MolParser-7M dataset and source code are publicly released.

Technology Category

Application Category

📝 Abstract
In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Automatically extract precise chemical structures from documents.
Overcome challenges in recognizing Markush structures and image variations.
Improve optical chemical structure recognition (OCSR) for real-world applications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end molecular image captioning model
Extended SMILES encoding for dataset annotation
Active learning with real-world document samples
🔎 Similar Papers
No similar papers found.