GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current geospatial foundation models (GeoFMs) lack standardized evaluation protocols across tasks and data characteristics, hindering fair performance comparison and application-specific model selection. To address this, we propose the first comprehensive benchmarking framework for GeoFMs, introducing a novel “capability grouping” paradigm that systematically characterizes model capabilities along key data dimensions—spatial resolution, spectral band count, and temporal granularity. The framework covers five core remote sensing tasks: classification, semantic/instance segmentation, object detection, and regression, integrating 19 open-source Earth observation (EO) datasets. It provides a standardized, modular, and multimodal-compatible evaluation protocol. We publicly release open-licensed code, benchmark data, and a dynamic leaderboard. Experimental results reveal no universally dominant model: natural-image pre-trained models excel in high-resolution tasks, whereas domain-specific GeoFMs outperform in multispectral modeling. This work advances standardization and reproducibility in EO AI research.

Technology Category

Application Category

📝 Abstract

Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce''capability''groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.

Problem

Research questions and friction points this paper is trying to address.

Standardizing evaluation protocols for Geospatial Foundation Models lacking consistent benchmarks

Developing capability-based ranking system to identify model strengths across diverse geospatial tasks

Enabling reproducible model comparison while supporting methodological innovation in geospatial AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized framework for geospatial AI evaluation

Capability groups rank models by shared characteristics

Flexible protocol supports fair comparison and adaptation

🔎 Similar Papers

TorchSpatial: A Location Encoding Framework and Benchmark for Spatial Representation Learning