ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing

📅 2023-11-20
🏛️ Int. J. Document Anal. Recognit.
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low accuracy and heavy reliance on OCR and computer vision in existing PDF-based chemical structure diagram parsing. We propose the first molecular graph parsing method that directly interprets native PDF vector instructions—including path construction, Bézier curve definitions, and text placement—bypassing rasterization and OCR entirely. Our approach extracts geometric primitives, performs geometric reasoning, enforces chemical topological constraints (e.g., valency, bond angle consistency), and reconstructs structures via an SVG intermediate representation. Evaluated on USPTO and PubMed PDF test sets, our method achieves 92.4% atomic connectivity accuracy—outperforming state-of-the-art CV-based methods by 17.6%. It demonstrates strong robustness to low-resolution scans, hand-drawn styles, and legacy documents. The core contribution is the first semantic interpretation of PDF vector instructions for chemical diagram parsing, eliminating dependence on rendering fidelity and file format specifics.
Problem

Research questions and friction points this paper is trying to address.

Parsing molecular diagrams from PDF graphics
Training neural network for molecule recognition
Evaluating parsers with SMILES and benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

PDF primitives for parsing
Multi-task neural network
Direct molecular graph comparison
🔎 Similar Papers
No similar papers found.
A
A. Shah
Document and Pattern Recognition Lab, Rochester Institute of Technology, NY, USA.
B
Bryan Amador
Document and Pattern Recognition Lab, Rochester Institute of Technology, NY, USA.
A
Abhisek Dey
Document and Pattern Recognition Lab, Rochester Institute of Technology, NY, USA.
M
Ming Creekmore
Document and Pattern Recognition Lab, Rochester Institute of Technology, NY, USA.
B
Blake Ocampo
Department of Chemistry, University of Illinois at Urbana-Champaign, IL, USA.
S
Scott Denmark
Department of Chemistry, University of Illinois at Urbana-Champaign, IL, USA.
R
R. Zanibbi
Document and Pattern Recognition Lab, Rochester Institute of Technology, NY, USA.