🤖 AI Summary
To address the challenges of high inter-class similarity, complex spatial structures, and scarcity of multi-label annotations in remote sensing imagery, this paper proposes the first unified framework integrating intra-modal and inter-modal contrastive learning. It innovatively incorporates multi-label supervision into multimodal contrastive learning by designing a novel multi-label supervised contrastive loss, enabling fine-grained semantic disentanglement and precise cross-modal alignment. The method jointly optimizes supervised contrastive learning, cross-modal alignment, and multi-label semantic modeling using optical and SAR imagery as dual inputs. Extensive experiments on BigEarthNet V2.0 and Sent12MS demonstrate significant improvements over both fully supervised and self-supervised baselines—particularly under low-label and high-category-overlap scenarios—achieving substantial gains in classification accuracy and clustering consistency. These results validate the framework’s strong generalization capability and superior semantic representation power.
📝 Abstract
Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.