MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing scientific evaluation benchmarks suffer from three key limitations: insufficient multilingual reasoning assessment, incomplete multimodal coverage, and coarse-grained knowledge annotation. To address these gaps, we introduce MME-SCI—the first comprehensive scientific benchmark integrating multilingual support (Chinese, English, French, Spanish, Japanese), multimodal inputs (image-only, text-only, image-text), and fine-grained disciplinary knowledge annotation (mathematics, physics, chemistry, biology), comprising 1,019 high-quality human-annotated question-answer pairs. We design three modality-specific evaluation protocols to enable systematic, attribution-aware analysis of cross-lingual reasoning and multimodal understanding capabilities. Extensive experiments across 20 mainstream models (16 open-weight and 4 closed-weight) demonstrate that MME-SCI significantly raises evaluation difficulty—e.g., o4-mini achieves <53% accuracy on single-image tasks across all four disciplines—and effectively uncovers critical bottlenecks in cross-lingual and multimodal scientific reasoning, thereby bridging dual gaps in multilingual scientific evaluation and fine-grained knowledge analysis.

Technology Category

Application Category

📝 Abstract

Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' reasoning in multilingual scientific contexts

Assessing comprehensive multimodal coverage in science benchmarks

Addressing lack of fine-grained scientific knowledge annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual science benchmark with five languages

Fine-grained annotation of scientific knowledge points

Comprehensive modality coverage across four subjects

🔎 Similar Papers

No similar papers found.