MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The capabilities of large language models (LLMs) and large vision-language models (LVLMs) in multimodal scientific reasoning—particularly in mathematics and physics—remain poorly characterized. Method: We introduce SciBench, the first dedicated multimodal benchmark for scientific reasoning, featuring both text-only and image-text questions, human-annotated difficulty levels, fine-grained subject categorization, and interpretable ground-truth answers. Our evaluation framework jointly models textual understanding, visual perception, cross-modal alignment, and cognitive difficulty to ensure fair and rigorous assessment of both LLMs and LVLMs. Contribution/Results: Experiments reveal that state-of-the-art models achieve only 63.77% overall accuracy, with a pronounced performance drop on image-based reasoning tasks—highlighting a critical bottleneck in vision-language collaborative reasoning. To foster reproducibility and community advancement, we publicly release SciBench on Hugging Face and the evaluation code on GitHub.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only extbf{63.77%} accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

Evaluating scientific reasoning in multimodal settings.
Assessing limitations in visual-textual integration and complex reasoning.
Benchmarking language models on multimodal scientific problems.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MMSciBench evaluates multimodal scientific reasoning.
Includes text-only and text-image problem formats.
Open-source code and dataset available online.
🔎 Similar Papers
No similar papers found.
Xinwu Ye
Xinwu Ye
The University of Hong Kong
AIDDLLMsbioinformatics
C
Chengfan Li
Department of Computer Science, Brown University
S
Siming Chen
School of Data Science, Fudan University
X
Xiangru Tang
Department of Computer Science, Yale University
W
Wei Wei
Datawiz LLC