Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the robustness of multimodal large language models (MLLMs) in verifying scientific claims using tabular and graphical evidence. We conduct cross-format experiments on an adapted scientific paper dataset with 12 state-of-the-art MLLMs and include human expert benchmarking for rigorous comparison. Results reveal that current MLLMs achieve strong performance on table understanding but exhibit substantial deficiencies in chart comprehension, coupled with poor cross-modal generalization; in contrast, human experts maintain high accuracy across both modalities. To our knowledge, this work provides the first quantitative characterization of structural weaknesses in MLLMs’ chart semantic parsing—highlighting the critical need to enhance chart understanding for trustworthy, AI-assisted scientific review. Our findings deliver empirical grounding for future model development, identifying concrete limitations and actionable directions for improving multimodal reasoning in scientific domains.

Technology Category

Application Category

📝 Abstract
With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models'multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.
Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of multimodal LLMs in verifying scientific claims using tables and charts
Assessing performance gap between table-based and chart-based evidence verification
Investigating limited cross-modal generalization in smaller multimodal LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated multimodal LLMs using adapted scientific datasets
Assessed claim verification across tables and charts formats
Found models struggle with chart-based evidence verification
🔎 Similar Papers
No similar papers found.
X
Xanh Ho
National Institute of Informatics, Japan
Y
Yun-Ang Wu
National Taiwan University
S
Sunisth Kumar
The University of Tokyo, Japan
Florian Boudin
Florian Boudin
Associate Professor, LS2N - Nantes Université and JFLI - National Institute of Informatics / Tokyo
Natural Language ProcessingInformation RetrievalComputational Linguistics
A
Atsuhiro Takasu
National Institute of Informatics, Japan
A
Akiko Aizawa
National Institute of Informatics, Japan; The University of Tokyo, Japan