🤖 AI Summary
Earth observation (EO) faces significant challenges due to substantial modality heterogeneity across heterogeneous sensors (e.g., optical, SAR, hyperspectral) and poor generalization of existing foundation models. To address this, we propose DOFA (“Dynamic One-For-All”), a novel multimodal foundation model that pioneers the integration of neural plasticity principles into remote sensing modeling. DOFA employs a Transformer-based dynamic hypernetwork architecture, augmented with multimodal feature alignment and wavelength-aware adapters, enabling real-time adaptation to unseen sensors and spectral band configurations within a single model. Trained via joint self-supervised pretraining across five sensor modalities, DOFA achieves state-of-the-art performance on 12 diverse EO downstream tasks. It substantially outperforms unimodal baselines, with cross-sensor transfer gains up to +27.3%, demonstrating exceptional generalization capability and strong potential for practical deployment.
📝 Abstract
The development of foundation models has revolutionized our ability to interpret the Earth's surface using satellite observational data. Traditional models have been siloed, tailored to specific sensors or data types like optical, radar, and hyperspectral, each with its own unique characteristics. This specialization hinders the potential for a holistic analysis that could benefit from the combined strengths of these diverse data sources. Our novel approach introduces the Dynamic One-For-All (DOFA) model, leveraging the concept of neural plasticity in brain science to integrate various data modalities into a single framework adaptively. This dynamic hypernetwork, adjusting to different wavelengths, enables a single versatile Transformer jointly trained on data from five sensors to excel across 12 distinct Earth observation tasks, including sensors never seen during pretraining. DOFA's innovative design offers a promising leap towards more accurate, efficient, and unified Earth observation analysis, showcasing remarkable adaptability and performance in harnessing the potential of multimodal Earth observation data.