🤖 AI Summary
Current Earth science multimodal benchmarks suffer from narrow coverage—typically confined to a single geosphere—and sparse evaluation dimensions (<16 tasks), limiting assessment of models’ understanding of Earth system integrity and inter-sphere coupling. To address this, we introduce the first comprehensive multimodal benchmark spanning all six Earth spheres—atmospheric, lithospheric, oceanic, cryospheric, biospheric, and anthropogenic—and their interactions. Built upon satellite and in-situ observational data, it comprises 100 expert-defined tasks across four capability tiers: perception, general reasoning, scientific knowledge reasoning, and chain-of-thought reasoning. We propose novel methodologies: inter-sphere coupling modeling, a four-tier reasoning evaluation framework, and a hybrid expert-crowdsourcing annotation paradigm. Evaluating nine state-of-the-art multimodal large language models reveals a maximum accuracy of only 34.7%; cross-sphere tasks consistently fail (e.g., GPT-4o achieves 0% on several). All data, code, and evaluation protocols are open-sourced to advance standardized AI for Earth system science.
📝 Abstract
Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions (less than 16 tasks). To address these gaps, we introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought (CoT) reasoning. This involves the efforts of 2-5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations, and finally, OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy. Especially, in some cross-spheres tasks, the performance of leading models like GPT-4o drops to 0.0%. OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction. The dataset, source code, and trained models were released.