Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the unclear advantages of existing Earth observation (EO)-specific foundation models for remote sensing image retrieval, particularly the lack of systematic evaluation of their cross-scene generalization capabilities. For the first time, it conducts a fair comparison between representative EO-specific models and general-purpose vision foundation models under unified datasets, retrieval protocols, and evaluation metrics. The experiments reveal that EO-specific pretraining alone does not inherently yield superior retrieval representations; general-purpose models often perform comparably or even better across most scenarios and demonstrate more stable transfer performance in cross-scene settings. These findings highlight a significant limitation of current EO-specific models in out-of-domain generalization.

📝 Abstract

Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.

Problem

Research questions and friction points this paper is trying to address.

remote sensing

vision foundation models

image retrieval

electro-optical

cross-scene generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision foundation models

remote sensing retrieval

electro-optical imagery