🤖 AI Summary
Existing remote sensing foundation models are constrained by fixed spectral band configurations and spatial resolutions, limiting generalization to practical scenarios involving band missingness, cross-sensor fusion, and unseen spatial scales. To address this, we propose RSFM—the first optical remote sensing foundation model supporting arbitrary band combinations, sensor types, and spatial resolutions. Our approach introduces four key innovations: (1) a spectrum-agnostic tokenizer; (2) multi-scale adaptive image patch embedding; (3) a multi-scale semantic alignment mechanism; and (4) a channel-level masked autoencoding pretraining strategy—enabling joint spectral-spatial modeling and dynamic resolution adaptation. Evaluated across 10+ benchmark datasets—including Sentinel-2, Landsat, and HLS—RSFM achieves state-of-the-art performance on band-missing, cross-sensor, and cross-resolution transfer tasks, significantly enhancing universal representation learning for heterogeneous, multi-source remote sensing data.
📝 Abstract
Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real world scenarios involving missing bands, cross sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment. To address these limitations, we propose Any Optical Model (AOM), a universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, AOM introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, AOM incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships. Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.