RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Chemical reaction diagrams in the literature are predominantly stored as raster images, rendering them inaccessible to automated parsing and modeling. To address this, we propose RxnCaption—a novel framework that reformulates reaction diagram recognition as a vision-language captioning task guided by visual prompts. Specifically, we introduce Boundary-box + Index Visual Prompts (BIVP) to localize and identify molecular entities, synergizing with MolYOLO for coarse molecule region pre-localization and enabling large vision-language models (LVLMs) to directly generate structured reaction descriptions. This approach eliminates conventional coordinate regression, simplifying model design and enhancing robustness. We further construct RxnCaption-11k, the first large-scale reaction diagram captioning dataset comprising 11,000 diverse samples spanning four canonical layout types. The proposed RxnCaption-VL model achieves state-of-the-art performance across multiple metrics, establishing a new paradigm for chemical image understanding and reaction knowledge extraction.

Technology Category

Application Category

📝 Abstract

Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed"BBox and Index as Visual Prompt"(BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Converting chemical reaction diagrams from images into machine-readable data

Reformulating diagram parsing as visual prompt guided captioning task

Extracting structured chemical information from scientific literature images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates reaction parsing as image captioning

Uses visual prompts with bounding boxes and indices

Leverages large vision-language models for chemistry

🔎 Similar Papers

No similar papers found.