Do Satellite Tasks Need Special Pretraining?

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work investigates whether domain-specific pre-trained vision models are necessary for remote sensing—particularly under low-data and low-resolution conditions. Method: Leveraging the MillionAID dataset, the authors train a ViT-B encoder using the iBOT self-supervised framework, incorporating remote-sensing–specific architectural modifications and training strategies. They further construct a lightweight, multi-task remote sensing benchmark to systematically evaluate generalization across diverse downstream tasks. Contribution/Results: Empirical evaluation reveals that current remote-sensing–specific pre-training methods do not consistently outperform general-purpose foundation models (e.g., DINOv2, MAE) on mainstream downstream tasks—challenging the widely held assumption that domain specialization inherently yields superior performance. To our knowledge, this is the first study to empirically assess the necessity of remote-sensing–specific foundation models under a unified experimental protocol. The findings provide critical guidance for model selection in resource-constrained remote sensing applications.

Technology Category

Application Category

📝 Abstract

Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.

Problem

Research questions and friction points this paper is trying to address.

Evaluating specialized pretraining for satellite imagery analysis

Comparing remote sensing foundation models with general vision models

Testing model generalization on lower resolution satellite images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised vision encoder trained on satellite dataset

Modified pretraining approach for remote sensing imagery

Benchmark measuring generalization to lower resolution images

🔎 Similar Papers

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction