MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the weak transferability of pretrained models caused by structural heterogeneity among Earth observation (EO) multimodal data—such as spectral, elevation, and segmentation maps—this work pioneers the adaptation of the MultiMAE framework to the EO domain. We propose a multimodal, multitask masked autoencoding pretraining method capable of processing arbitrary subsets of modalities. By enforcing cross-modal feature alignment and joint reconstruction, our approach abandons modality-specific pretraining paradigms and enables a unified model to flexibly accommodate heterogeneous inputs. Evaluated on multiple EO benchmarks, our method surpasses state-of-the-art approaches on both classification and segmentation tasks. Under end-to-end fine-tuning, it delivers consistent transfer performance gains of 12.6%–18.3%, significantly enhancing generalization capability and deployment flexibility.

Technology Category

Application Category

📝 Abstract

Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.

Problem

Research questions and friction points this paper is trying to address.

Enhancing transfer learning with multi-modal EO data

Overcoming challenges in adapting pre-training to downstream tasks

Flexible multi-task pre-training for diverse EO input modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal multi-task masked autoencoder for EO

Pre-training with diverse input modalities reconstruction

Flexible handling of varied input configurations

🔎 Similar Papers

No similar papers found.