🤖 AI Summary
Existing geospatial foundation models (GFMs) suffer from geographic bias (e.g., overrepresentation of North America and Europe), inconsistent evaluation protocols, and narrow task coverage—hindering rigorous assessment of their global generalization capability. To address this, we introduce PANGAEA, the first global, multimodal, and geographically balanced benchmark for GFMs. It encompasses multiscale, multi-sensor (e.g., Sentinel, Landsat, NAIP), and multi-temporal remote sensing data, supporting diverse downstream tasks including classification, semantic segmentation, and object detection. We propose a geographically inclusive evaluation framework enabling zero-shot and few-shot transfer, as well as dynamic benchmark expansion. Comprehensive evaluation across 30+ datasets reveals that GFM performance varies significantly across geographic regions and tasks, and under low-label regimes, GFMs do not consistently outperform supervised baselines. We publicly release all code and benchmark resources to foster reproducible, fair, and comparable GFM evaluation.
📝 Abstract
Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at https://github.com/VMarsocci/pangaea-bench.