MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of high inter-class similarity, complex spatial structures, and scarcity of multi-label annotations in remote sensing imagery, this paper proposes the first unified framework integrating intra-modal and inter-modal contrastive learning. It innovatively incorporates multi-label supervision into multimodal contrastive learning by designing a novel multi-label supervised contrastive loss, enabling fine-grained semantic disentanglement and precise cross-modal alignment. The method jointly optimizes supervised contrastive learning, cross-modal alignment, and multi-label semantic modeling using optical and SAR imagery as dual inputs. Extensive experiments on BigEarthNet V2.0 and Sent12MS demonstrate significant improvements over both fully supervised and self-supervised baselines—particularly under low-label and high-category-overlap scenarios—achieving substantial gains in classification accuracy and clustering consistency. These results validate the framework’s strong generalization capability and superior semantic representation power.

Technology Category

Application Category

📝 Abstract
Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses multi-modal satellite imagery representation learning challenges
Overcomes high inter-class similarity and ambiguous boundaries in ESO
Enhances semantic precision in low-label multi-label remote sensing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal multi-label contrastive learning framework
Unified intra- and inter-modality contrastive optimization
Supervised contrastive loss for semantic disentanglement
🔎 Similar Papers
No similar papers found.
Debashis Gupta
Debashis Gupta
Graduate Student, Wake Forest University, NC, USA
Biomedical EngineeringRemote SensingMachine LearningComputer VisionSecurity Threats
A
Aditi Golder
Wake Forest University, NC, USA
R
Rongkhun Zhu
Xidian University, China
Kangning Cui
Kangning Cui
Research Assistant Professor of Computer Science, Wake Forest University
Applied MathematicsComputational SustainabilityMedical Imaging
W
Wei Tang
City University of Hong Kong, Hong Kong
F
Fan Yang
Wake Forest University, NC, USA
Ovidiu Csillik
Ovidiu Csillik
Wake Forest University, NC, USA
S
Sarra Alaqahtani
Wake Forest University, NC, USA
V
V. Paul Pauca
Wake Forest University, NC, USA