Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This study addresses the unclear advantages of existing Earth observation (EO)-specific foundation models for remote sensing image retrieval, particularly the lack of systematic evaluation of their cross-scene generalization capabilities. For the first time, it conducts a fair comparison between representative EO-specific models and general-purpose vision foundation models under unified datasets, retrieval protocols, and evaluation metrics. The experiments reveal that EO-specific pretraining alone does not inherently yield superior retrieval representations; general-purpose models often perform comparably or even better across most scenarios and demonstrate more stable transfer performance in cross-scene settings. These findings highlight a significant limitation of current EO-specific models in out-of-domain generalization.
📝 Abstract
Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.
Problem

Research questions and friction points this paper is trying to address.

remote sensing
vision foundation models
image retrieval
electro-optical
cross-scene generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision foundation models
remote sensing retrieval
electro-optical imagery
cross-scene generalization
controlled comparison
🔎 Similar Papers
No similar papers found.