π€ AI Summary
Chemical reaction diagrams in the literature are predominantly stored as raster images, rendering them inaccessible to automated parsing and modeling. To address this, we propose RxnCaptionβa novel framework that reformulates reaction diagram recognition as a vision-language captioning task guided by visual prompts. Specifically, we introduce Boundary-box + Index Visual Prompts (BIVP) to localize and identify molecular entities, synergizing with MolYOLO for coarse molecule region pre-localization and enabling large vision-language models (LVLMs) to directly generate structured reaction descriptions. This approach eliminates conventional coordinate regression, simplifying model design and enhancing robustness. We further construct RxnCaption-11k, the first large-scale reaction diagram captioning dataset comprising 11,000 diverse samples spanning four canonical layout types. The proposed RxnCaption-VL model achieves state-of-the-art performance across multiple metrics, establishing a new paradigm for chemical image understanding and reaction knowledge extraction.
π Abstract
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed"BBox and Index as Visual Prompt"(BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.