MolMole: Molecule Mining from Scientific Literature

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Automated extraction of molecular structures and reaction data from scientific literature is hindered by chemical representation diversity, document unstructuredness, and complex layouts. This paper introduces the first end-to-end vision-driven framework that jointly performs molecule instance detection, reaction graph topology parsing, and optical chemical structure recognition (OCSR) directly from full-page PDFs or images. We unify these three tasks within a single model architecture, construct the first page-level benchmark dataset comprising 550 annotated pages with MOLfile ground truth, and propose dedicated evaluation metrics. Our method integrates document layout-aware detection with reaction graph structural reasoning. It achieves state-of-the-art performance on our proprietary benchmark and multiple public datasets. The code, pre-trained models, and an interactive online demo will be publicly released.

Technology Category

Application Category

📝 Abstract

The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automating the extraction of chemical data directly from page-level documents. Recognizing the lack of a standard page-level benchmark and evaluation metric, we also present a testset of 550 pages annotated with molecule bounding boxes, reaction labels, and MOLfiles, along with a novel evaluation metric. Experimental results demonstrate that MolMole outperforms existing toolkits on both our benchmark and public datasets. The benchmark testset will be publicly available, and the MolMole toolkit will be accessible soon through an interactive demo on the LG AI Research website. For commercial inquiries, please contact us at href{mailto:contact_ddu@lgresearch.ai}{contact_ddu@lgresearch.ai}.

Problem

Research questions and friction points this paper is trying to address.

Extracting molecular structures from unstructured scientific documents

Parsing reaction diagrams and chemical data automatically

Lacking standard benchmarks for page-level chemical data extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-based deep learning for molecule detection

Unified pipeline for chemical data extraction

Novel benchmark and evaluation metric

🔎 Similar Papers

An Autonomous Large Language Model Agent for Chemical Literature Data Mining