MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of labeled data and underutilization of multimodal (RGB, multispectral, DSM) information in remote sensing, this paper proposes a self-supervised pretraining framework for high-resolution multimodal remote sensing imagery. The method introduces an information-aware adaptive masking scheme and a cross-modal masking mechanism to jointly model inter-modal correlations and intra-modal structural characteristics. It further incorporates multi-task self-supervised objectives and a cross-modal knowledge distillation strategy. Evaluated on 26 downstream tasks, the framework consistently outperforms state-of-the-art methods: semantic segmentation achieves mIoU of 78.30% on Potsdam and 76.50% on Vaihingen using only 50% labeled data; depth estimation on US3D attains RMSE = 0.182; and change detection with SECOND yields a 3.0-percentage-point mIoU improvement. The approach significantly enhances few-shot remote sensing interpretation performance.

Technology Category

Application Category

📝 Abstract
Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high-quality labeled data is often costly and time-consuming. To address this challenge, we proposes a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training. By designing an information-aware adaptive masking strategy, cross-modal masking mechanism, and multi-task self-supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30% and 76.50%, with only 50% train-set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51%, surpassing the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and HR-Pairs dataset can be found in https://github.com/CVEO/MSSDF.
Problem

Research questions and friction points this paper is trying to address.

Reducing labeled data cost for remote sensing tasks
Enhancing multi-modal image feature learning
Improving performance on diverse downstream applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal self-supervised learning framework
Information-aware adaptive masking strategy
Cross-modal masking and multi-task objectives
🔎 Similar Papers
No similar papers found.
T
Tong Wang
Guanzhou Chen
Guanzhou Chen
Shanghai Jiao Tong University; Shanghai AI Laboratory
X
Xiaodong Zhang
C
Chenxi Liu
J
Jiaqi Wang
X
Xiaoliang Tan
W
Wenchao Guo
Q
Qingyuan Yang
Kaiqi Zhang
Kaiqi Zhang
Syracuse University
Artificial IntelligenceDeep Learning