FUS-MAE: A Cross-Attention-Based Data Fusion Approach for Masked Autoencoders in Remote Sensing

📅 2024-01-05
🏛️ IEEE International Geoscience and Remote Sensing Symposium
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of cross-modal fusion between synthetic aperture radar (SAR) and multispectral optical imagery in remote sensing—namely, severe modality discrepancy and high annotation costs. We propose a self-supervised multimodal fusion framework built upon the Masked Autoencoder (MAE). Its core innovation lies in the first integration of cross-attention mechanisms into the MAE encoder, enabling early-stage, alignment-aware feature-level fusion of SAR and optical data without hand-crafted augmentation strategies—thereby overcoming key limitations of contrastive learning in remote sensing cross-domain pretraining. The method jointly handles multiview image registration and feature alignment. In downstream tasks, it matches the performance of task-specific contrastive learning approaches and substantially outperforms a larger-scale MAE baseline trained under the same architecture. To ensure reproducibility, our code and pretrained weights are publicly released.

Technology Category

Application Category

📝 Abstract
Self-supervised frameworks for representation learning have recently stirred up interest among the remote sensing community, given their potential to mitigate the high labeling costs associated with curating large satellite image datasets. In the realm of multimodal data fusion, while contrastive learning methods can help bridge the domain gap between different sensor types, they rely on data augmentation techniques that require expertise and careful design, especially for multispectral remote sensing data. A possible but rather scarcely studied way to circumvent these limitations is to use a masked image modelling based pretraining strategy. In this paper, we introduce Fus-MAE, a self-supervised learning framework based on masked autoencoders that uses cross-attention to perform early and feature-level data fusion between synthetic aperture radar and multispectral optical data - two modalities with a significant domain gap. Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion and outperforms other masked-autoencoders frameworks trained on a larger corpus. For replicability, code and weights are provided in this github repository.
Problem

Research questions and friction points this paper is trying to address.

Reducing labeling costs for large satellite image datasets
Bridging domain gap between SAR and optical data
Avoiding complex data augmentations in contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention based fusion for multimodal data
Masked autoencoder for self-supervised learning
Early feature-level fusion of SAR-optical data
🔎 Similar Papers
No similar papers found.
H
Hugo Chan-To-Hing
National University of Singapore, Department of Electrical and Computer Engineering
Bharadwaj Veeravalli
Bharadwaj Veeravalli
NUS, Singapore
Parallel & Distributed ComputingCloud ComputingGrid ComputingHigh-Performance ComputingEmbedded Computing