🤖 AI Summary
This work addresses the weak generalization of remote sensing foundation models under cross-satellite scenarios—where spectral bands are non-overlapping or newly introduced. To tackle this, we propose GeoCrossBench: the first systematic benchmark and evaluation protocol specifically designed for cross-band generalization, filling a critical gap in existing GeoBench frameworks that assume band alignment. Methodologically, we introduce ChiViT—a ChannelViT-based architecture integrating multispectral channel modeling and cross-satellite feature alignment, trained via self-supervised pretraining. Experiments reveal that mainstream models suffer 2–4× performance degradation on cross-satellite tasks; in contrast, ChiViT significantly outperforms strong baselines (e.g., DINOv3) and achieves stable cross-satellite transfer with only last-layer fine-tuning, demonstrating superior generalization and deployment efficiency. This work redefines the evaluation paradigm for remote sensing foundation models, establishing a new standard and technical pathway toward robust, large-scale remote sensing models.
📝 Abstract
The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.