MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal benchmarks for Earth science predominantly rely on synthetic data or simplistic image–text pairs, limiting their utility for evaluating deep reasoning and domain-specific scientific insight. To address this, we introduce MSEarth—the first graduate-level multimodal scientific understanding benchmark for Earth science. It spans the five major geospheres (atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere) and comprises over 7,000 authentic research figures paired with expert-curated contextual reasoning annotations that integrate figure captions and paper discussion sections. Our key innovation lies in the systematic integration of open-science literature figures with rich, reasoning-oriented textual annotations, enabling high-level evaluation tasks including open-ended question answering, multiple-choice reasoning, and scientific figure description. MSEarth is compatible with mainstream multimodal large language model (MLLM) evaluation frameworks and is publicly released on Hugging Face and GitHub, providing a high-fidelity, scalable foundation for assessing and advancing MLLM reasoning capabilities in Earth science.

Technology Category

Application Category

📝 Abstract
The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 7K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field. Resources related to this benchmark can be found at https://huggingface.co/MSEarth and https://github.com/xiangyu-mm/MSEarth.
Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for geoscientific reasoning complexity
Need for multimodal Earth science comprehension evaluation
Absence of graduate-level MLLM assessment resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for Earth science
High-quality open-access scientific publications
Supports captioning, MCQs, and reasoning tasks
🔎 Similar Papers
No similar papers found.
X
Xiangyu Zhao
Shanghai Artificial Intelligence Laboratory, The Hong Kong Polytechnic University
W
Wanghan Xu
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
B
Bo Liu
The Hong Kong Polytechnic University
Y
Yuhao Zhou
Shanghai Artificial Intelligence Laboratory
Fenghua Ling
Fenghua Ling
Shanghai Artificial Intelligence Laboratory
AI4ClimateClimate predictionWeather prediction
B
Ben Fei
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
Xiaoyu Yue
Xiaoyu Yue
The University of Sydney
Computer Vision
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery
W
Wenlong Zhang
Shanghai Artificial Intelligence Laboratory
X
Xiao-Ming Wu
The Hong Kong Polytechnic University