MolMole: Molecule Mining from Scientific Literature

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated extraction of molecular structures and reaction data from scientific literature is hindered by chemical representation diversity, document unstructuredness, and complex layouts. This paper introduces the first end-to-end vision-driven framework that jointly performs molecule instance detection, reaction graph topology parsing, and optical chemical structure recognition (OCSR) directly from full-page PDFs or images. We unify these three tasks within a single model architecture, construct the first page-level benchmark dataset comprising 550 annotated pages with MOLfile ground truth, and propose dedicated evaluation metrics. Our method integrates document layout-aware detection with reaction graph structural reasoning. It achieves state-of-the-art performance on our proprietary benchmark and multiple public datasets. The code, pre-trained models, and an interactive online demo will be publicly released.

Technology Category

Application Category

📝 Abstract
The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automating the extraction of chemical data directly from page-level documents. Recognizing the lack of a standard page-level benchmark and evaluation metric, we also present a testset of 550 pages annotated with molecule bounding boxes, reaction labels, and MOLfiles, along with a novel evaluation metric. Experimental results demonstrate that MolMole outperforms existing toolkits on both our benchmark and public datasets. The benchmark testset will be publicly available, and the MolMole toolkit will be accessible soon through an interactive demo on the LG AI Research website. For commercial inquiries, please contact us at href{mailto:contact_ddu@lgresearch.ai}{contact_ddu@lgresearch.ai}.
Problem

Research questions and friction points this paper is trying to address.

Extracting molecular structures from unstructured scientific documents
Parsing reaction diagrams and chemical data automatically
Lacking standard benchmarks for page-level chemical data extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-based deep learning for molecule detection
Unified pipeline for chemical data extraction
Novel benchmark and evaluation metric
🔎 Similar Papers
No similar papers found.
L
LG AI Research
S
Sehyun Chun
J
Jiye Kim
A
Ahra Jo
Y
Yeonsik Jo
S
Seungyul Oh
S
Seungjun Lee
K
Kwang-seok Ryoo
J
Jongmin Lee
Seunghwan Kim
Seunghwan Kim
Seoul National University
B
Byung Jun Kang
Soonyoung Lee
Soonyoung Lee
LG AI Research
Computer VisionMachine Learning
J
Jun Ha Park
C
Chanwoo Moon
J
Jiwon Ham
H
Haein Lee
H
Heejae Han
J
Jaeseung Byun
S
Soojong Do
M
Minju Ha
D
Dongyun Kim
Kyunghoon Bae
Kyunghoon Bae
LG AI Research
Generative AIComputer VisionNatural Language ProcessingContinual LearningExplainable AI
Woohyung Lim
Woohyung Lim
LG AI Research
Deep LearningRepresentation LearningAnomaly DetectionTime-series Forecasting
E
Edward Hwayoung Lee
Y
Yongmin Park
J
Jeongsang Yu
G
Gerrard Jeongwon Jo
Y
Yeonjung Hong
K
Kyungjae Yoo
S
Sehui Han
J
Jaewan Lee
C
Changyoung Park
K
Kijeong Jeon
S
Sihyuk Yi