🤖 AI Summary
Existing visual reasoning benchmarks are often confined to narrow, task-specific settings and thus fail to adequately assess AI systems’ capabilities in open-ended geoscientific problems. This work proposes GeoR-Bench, the first multidimensional benchmark for geoscientific visual reasoning, encompassing six major categories and 24 distinct tasks that evaluate models’ integrated understanding of geoscience imagery and scientific diagrams through reasoning-based visual editing. The benchmark introduces a tripartite evaluation framework—reasoning correctness, logical consistency, and output quality—combined with structured representation analysis and strict accuracy metrics. Evaluations across 21 state-of-the-art multimodal models reveal that even the best-performing model achieves only 42.7% strict accuracy, while open-source models score a mere 10.3%, highlighting a significant gap: although current models can produce visually plausible outputs, they critically lack deep comprehension of underlying geoscientific processes.
📝 Abstract
Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.