MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

The capabilities of large language models (LLMs) and large vision-language models (LVLMs) in multimodal scientific reasoning—particularly in mathematics and physics—remain poorly characterized. Method: We introduce SciBench, the first dedicated multimodal benchmark for scientific reasoning, featuring both text-only and image-text questions, human-annotated difficulty levels, fine-grained subject categorization, and interpretable ground-truth answers. Our evaluation framework jointly models textual understanding, visual perception, cross-modal alignment, and cognitive difficulty to ensure fair and rigorous assessment of both LLMs and LVLMs. Contribution/Results: Experiments reveal that state-of-the-art models achieve only 63.77% overall accuracy, with a pronounced performance drop on image-based reasoning tasks—highlighting a critical bottleneck in vision-language collaborative reasoning. To foster reproducibility and community advancement, we publicly release SciBench on Hugging Face and the evaluation code on GitHub.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only extbf{63.77%} accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.

Problem

Research questions and friction points this paper is trying to address.

Evaluating scientific reasoning in multimodal settings.

Assessing limitations in visual-textual integration and complex reasoning.

Benchmarking language models on multimodal scientific problems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

MMSciBench evaluates multimodal scientific reasoning.

Includes text-only and text-image problem formats.

Open-source code and dataset available online.

🔎 Similar Papers

No similar papers found.