🤖 AI Summary
This work addresses key challenges in chemical reaction diagram parsing, including the misalignment between visual chemical entities and pretrained knowledge, as well as the mismatch between token-level training objectives and reaction-level evaluation metrics. To overcome these issues, the authors propose IdtVP, a method that leverages molecular identifiers as visual prompts to activate chemical priors embedded in vision-language models. They further introduce Re3-DAPO, a reinforcement learning algorithm guided by verifiable rewards to directly optimize reaction-level performance. The study also presents ScannedRxn, the first benchmark comprising real-world scanned reaction diagrams. Experimental results demonstrate that the proposed approach significantly outperforms existing methods on both standard and scanned reaction datasets, exhibiting strong zero-shot capability, robustness, and out-of-distribution generalization.
📝 Abstract
Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.