🤖 AI Summary
The capabilities of large language models (LLMs) and large vision-language models (LVLMs) in multimodal scientific reasoning—particularly in mathematics and physics—remain poorly characterized. Method: We introduce SciBench, the first dedicated multimodal benchmark for scientific reasoning, featuring both text-only and image-text questions, human-annotated difficulty levels, fine-grained subject categorization, and interpretable ground-truth answers. Our evaluation framework jointly models textual understanding, visual perception, cross-modal alignment, and cognitive difficulty to ensure fair and rigorous assessment of both LLMs and LVLMs. Contribution/Results: Experiments reveal that state-of-the-art models achieve only 63.77% overall accuracy, with a pronounced performance drop on image-based reasoning tasks—highlighting a critical bottleneck in vision-language collaborative reasoning. To foster reproducibility and community advancement, we publicly release SciBench on Hugging Face and the evaluation code on GitHub.
📝 Abstract
Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only extbf{63.77%} accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.