🤖 AI Summary
This study addresses the overlooked influence of geographic composition in pretraining data for geospatial foundation models. By systematically constructing pretraining datasets at global and continental scales, the authors employ uniform model architectures with multiscale sampling and downstream task evaluation to reveal, for the first time, a strong positive correlation between spectral diversity and model performance, whereas geographic, biome, and land cover diversities show weaker associations. Notably, models pretrained on European data consistently achieve superior performance across both global and local tasks. The work releases seven novel pretraining datasets, their corresponding models, and an evaluation framework, establishing new data design principles for high-performance geospatial pretraining.
📝 Abstract
New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.