🤖 AI Summary
This study addresses the unclear advantages of existing Earth observation (EO)-specific foundation models for remote sensing image retrieval, particularly the lack of systematic evaluation of their cross-scene generalization capabilities. For the first time, it conducts a fair comparison between representative EO-specific models and general-purpose vision foundation models under unified datasets, retrieval protocols, and evaluation metrics. The experiments reveal that EO-specific pretraining alone does not inherently yield superior retrieval representations; general-purpose models often perform comparably or even better across most scenarios and demonstrate more stable transfer performance in cross-scene settings. These findings highlight a significant limitation of current EO-specific models in out-of-domain generalization.
📝 Abstract
Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.