RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (MLLMs) lack systematic evaluation in comprehending dense graphical reaction language—such as reaction mechanisms and molecular structure identification—in chemical literature. Method: We introduce RxnBench, the first multimodal benchmark tailored for chemical literature, featuring two tasks—single-figure question answering (FD-QA) and full-document question answering (FD-QA)—and propose a hierarchical evaluation framework to assess cross-modal figure-text-table integration, reaction-logic reasoning, and precise structural perception. Contribution/Results: Experiments reveal that state-of-the-art MLLMs achieve <50% accuracy on FD-QA; inference-time techniques like chain-of-thought prompting significantly improve performance. Our analysis uncovers critical limitations in general-purpose vision encoders and underscores the urgent need for domain-specific visual representations and chemistry-aware reasoning modules. RxnBench establishes a new standard for evaluating chemical AI, offering not only a rigorous benchmark but also methodological advances and foundational insights into multimodal scientific understanding.

Technology Category

Application Category

📝 Abstract
The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' chemical reaction understanding from scientific literature
Assesses multimodal integration of text, schemes, and tables in documents
Identifies gaps in deep chemical logic and structural recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for chemical reaction evaluation
Two-tiered tasks testing visual and textual integration
Domain-specific encoders and reasoning engines needed
🔎 Similar Papers
No similar papers found.
H
Hanzheng Li
Shanghai Jiao Tong University
X
Xi Fang
DP Technology
Y
Yixuan Li
Tsinghua University
C
Chaozheng Huang
DP Technology
J
Junjie Wang
DP Technology
X
Xi Wang
New York University
H
Hongzhe Bai
Fudan University
B
Bojun Hao
Xiamen University
S
Shenyu Lin
Shanghai Jiao Tong University
H
Huiqi Liang
ShanghaiTech University
Linfeng Zhang
Linfeng Zhang
DP Technology; AI for Science Institute
AI for Sciencemulti-scale modelingmolecular simulationdrug/materials design
Guolin Ke
Guolin Ke
DP Technology
Machine LearningAI for Science