🤖 AI Summary
Existing vision-language model benchmarks struggle to evaluate complex, traceable multimodal scientific reasoning. To address this gap, this work introduces a multimodal scientific reasoning benchmark spanning 54 subfields across six scientific disciplines, incorporating domain-specific visual elements such as charts and mathematical formulas. The benchmark requires models to integrate visual understanding with multi-step reasoning to answer questions and, for the first time, enables joint evaluation of both reasoning processes and final answers by providing expert-annotated solution steps. Emphasizing interdisciplinary knowledge integration and explainable reasoning, the benchmark reveals significant shortcomings in current state-of-the-art open- and closed-source models when handling complex multimodal scientific tasks, thereby offering clear guidance for future model development.
📝 Abstract
Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.