GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Optical Chemical Structure Recognition (OCSR) faces challenges including molecular image complexity, diverse functional-group abbreviations, and inconsistent annotations, limiting the generalization of existing vision-language models (VLMs). To address these, we propose the first “graph traversal as visual chain-of-thought” mechanism for OCSR, mimicking human atom-by-atom and bond-by-bond structural parsing. We further introduce the “see-as-recognize” data principle, explicitly aligning image-based abbreviations with canonical chemical semantics. Built upon a VLM architecture, our approach integrates instruction tuning (GTR-CoT-1.3M), graph-structured decoding, and the fine-grained benchmark MolRec-Bench. On molecule images containing abbreviations, our method achieves ~14 percentage-point improvements over the best prior baseline in both SMILES generation and molecular graph reconstruction—outperforming specialized models, chemistry-domain VLMs, and commercial general-purpose VLMs.

Technology Category

Application Category

📝 Abstract
Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the extit{Graph Traversal as Visual Chain of Thought} mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of extit{Faithfully Recognize What You've Seen}, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.
Problem

Research questions and friction points this paper is trying to address.

Recognizing complex molecular structures from images accurately
Addressing inconsistent annotations in molecular structure digitization
Improving Optical Chemical Structure Recognition (OCSR) for real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Traversal as Visual Chain of Thought
Faithfully Recognize What You've Seen
Large-scale instruction-tuning dataset GTR-CoT-1.3M
🔎 Similar Papers
No similar papers found.