🤖 AI Summary
Existing chemical multimodal benchmarks suffer from shallow semantics and modality scarcity, failing to rigorously evaluate models’ cross-modal joint reasoning over complex chemical entities—such as organic molecules, inorganic materials, and 3D crystals. To address this, we introduce ChemBench, the first chemistry-specific tri-modal evaluation benchmark integrating molecular images, image-text pairs, and SMILES strings, accompanied by a standardized, automated assessment framework enabling fine-grained error diagnosis and reasoning-behavior analysis. Our method unifies visual understanding, SMILES structural parsing, and language-based reasoning into an end-to-end evaluation pipeline. Experiments reveal that state-of-the-art multimodal large language models (MLLMs) exhibit significant deficiencies on vision-driven structural chemistry tasks; while multimodal fusion partially alleviates, it does not fundamentally resolve core semantic errors. ChemBench establishes a reproducible, decomposable, and actionable evaluation paradigm for chemical AI, advancing rigorous, interpretable, and domain-grounded model assessment.
📝 Abstract
Chemical reasoning inherently integrates visual, textual, and symbolic modalities, yet existing benchmarks rarely capture this complexity, often relying on simple image-text pairs with limited chemical semantics. As a result, the actual ability of Multimodal Large Language Models (MLLMs) to process and integrate chemically meaningful information across modalities remains unclear. We introduce extbf{ChemVTS-Bench}, a domain-authentic benchmark designed to systematically evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of MLLMs. ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures, with each task presented in three complementary input modes: (1) visual-only, (2) visual-text hybrid, and (3) SMILES-based symbolic input. This design enables fine-grained analysis of modality-dependent reasoning behaviors and cross-modal integration. To ensure rigorous and reproducible evaluation, we further develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes. Extensive experiments on state-of-the-art MLLMs reveal that visual-only inputs remain challenging, structural chemistry is the hardest domain, and multimodal fusion mitigates but does not eliminate visual, knowledge-based, or logical errors, highlighting ChemVTS-Bench as a rigorous, domain-faithful testbed for advancing multimodal chemical reasoning. All data and code will be released to support future research.