MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of high inter-class similarity, complex spatial structures, and scarcity of multi-label annotations in remote sensing imagery, this paper proposes the first unified framework integrating intra-modal and inter-modal contrastive learning. It innovatively incorporates multi-label supervision into multimodal contrastive learning by designing a novel multi-label supervised contrastive loss, enabling fine-grained semantic disentanglement and precise cross-modal alignment. The method jointly optimizes supervised contrastive learning, cross-modal alignment, and multi-label semantic modeling using optical and SAR imagery as dual inputs. Extensive experiments on BigEarthNet V2.0 and Sent12MS demonstrate significant improvements over both fully supervised and self-supervised baselines—particularly under low-label and high-category-overlap scenarios—achieving substantial gains in classification accuracy and clustering consistency. These results validate the framework’s strong generalization capability and superior semantic representation power.

Technology Category

Application Category

📝 Abstract

Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-modal satellite imagery representation learning challenges

Overcomes high inter-class similarity and ambiguous boundaries in ESO

Enhances semantic precision in low-label multi-label remote sensing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal multi-label contrastive learning framework

Unified intra- and inter-modality contrastive optimization

Supervised contrastive loss for semantic disentanglement

🔎 Similar Papers

No similar papers found.

Authors to Follow