π€ AI Summary
Existing geospatial benchmark datasets suffer from limited modality diversity and insufficient global coverage, hindering the evaluation of multimodal modelsβ generalization under geographic distribution shifts. To address this gap, this work introduces MMEarth-Bench, the first large-scale, multimodal, and globally comprehensive benchmark, encompassing five environmental tasks and twelve distinct data modalities. Furthermore, the paper proposes TTT-MMR, a model-agnostic test-time training method that leverages all available modalities during inference to perform multimodal reconstruction, thereby overcoming the input modality constraints inherent in conventional approaches. Experimental results demonstrate that TTT-MMR substantially improves model performance on both randomly partitioned and geographically out-of-distribution test sets, effectively enhancing cross-regional adaptability.
π Abstract
Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at lgordon99.github.io/mmearth-bench.