ChemVTS-Bench: Evaluating Visual-Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chemical multimodal benchmarks suffer from shallow semantics and modality scarcity, failing to rigorously evaluate models’ cross-modal joint reasoning over complex chemical entities—such as organic molecules, inorganic materials, and 3D crystals. To address this, we introduce ChemBench, the first chemistry-specific tri-modal evaluation benchmark integrating molecular images, image-text pairs, and SMILES strings, accompanied by a standardized, automated assessment framework enabling fine-grained error diagnosis and reasoning-behavior analysis. Our method unifies visual understanding, SMILES structural parsing, and language-based reasoning into an end-to-end evaluation pipeline. Experiments reveal that state-of-the-art multimodal large language models (MLLMs) exhibit significant deficiencies on vision-driven structural chemistry tasks; while multimodal fusion partially alleviates, it does not fundamentally resolve core semantic errors. ChemBench establishes a reproducible, decomposable, and actionable evaluation paradigm for chemical AI, advancing rigorous, interpretable, and domain-grounded model assessment.

Technology Category

Application Category

📝 Abstract
Chemical reasoning inherently integrates visual, textual, and symbolic modalities, yet existing benchmarks rarely capture this complexity, often relying on simple image-text pairs with limited chemical semantics. As a result, the actual ability of Multimodal Large Language Models (MLLMs) to process and integrate chemically meaningful information across modalities remains unclear. We introduce extbf{ChemVTS-Bench}, a domain-authentic benchmark designed to systematically evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of MLLMs. ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures, with each task presented in three complementary input modes: (1) visual-only, (2) visual-text hybrid, and (3) SMILES-based symbolic input. This design enables fine-grained analysis of modality-dependent reasoning behaviors and cross-modal integration. To ensure rigorous and reproducible evaluation, we further develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes. Extensive experiments on state-of-the-art MLLMs reveal that visual-only inputs remain challenging, structural chemistry is the hardest domain, and multimodal fusion mitigates but does not eliminate visual, knowledge-based, or logical errors, highlighting ChemVTS-Bench as a rigorous, domain-faithful testbed for advancing multimodal chemical reasoning. All data and code will be released to support future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal reasoning in chemical contexts
Assessing integration of visual-textual-symbolic chemical information
Benchmarking MLLMs on domain-authentic chemistry problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates multimodal chemical reasoning abilities
Uses three input modes: visual, hybrid, and symbolic
Automated agent-based workflow standardizes evaluation process
🔎 Similar Papers
No similar papers found.
Z
Zhiyuan Huang
Renmin University of China
B
Baichuan Yang
Beijing University of Posts and Telecommunications
Z
Zikun He
Renmin University of China
Yanhong Wu
Yanhong Wu
Meta
Graph ModelingDeep LearningVisual Analytics
F
Fang Hongyu
Gaotu Techedu Inc
Z
Zhenhe Liu
Renmin University of China
L
Lin Dongsheng
Gaotu Techedu Inc
B
Bing Su
Renmin University of China