🤖 AI Summary
To address the challenges of inconsistent feature representations across remote sensing sensors (e.g., Sentinel-2 and aerial imagery) and limited high-resolution annotated data—leading to poor cross-resolution generalization—this paper proposes X-STARS, a cross-sensor self-supervised training and alignment framework. Its core innovation is the first multi-sensor alignment dense loss, which achieves cross-resolution and cross-platform feature alignment via contrastive image-patch matching. X-STARS supports both from-scratch pretraining and continual pretraining paradigms. Evaluated on our newly constructed Cities-France multi-sensor dataset, X-STARS consistently outperforms state-of-the-art methods across seven downstream classification and segmentation tasks. Moreover, it achieves comparable performance using 30–50% fewer annotated samples, significantly reducing annotation burden while enhancing model transferability across heterogeneous remote sensing modalities.
📝 Abstract
Large-scale “foundation models” have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation (EO) satellites, these models should learn “sensor agnostic” representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To better leverage multisensor data, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the multi-sensor alignment dense loss, to align representations across sensors, even with vastly different resolutions, through a contrastive patch-wise mechanism. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g. low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release multi-sensors cities-France, a new multisensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art with less data across various conditions of data availability and resolutions.