CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation

📅 2024-05-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the challenges of large-scale noisy labels and underutilized multimodal data in remote sensing image semantic segmentation, this paper proposes the first clean-label-free cross-modal self-supervised pretraining framework. Leveraging Sentinel-1/2 dual-source satellite imagery, we introduce a cross-modal probabilistic consistency modeling mechanism and design an entropy- and confidence-guided noisy label filtering strategy to dynamically select reliable pseudo-labels and enable robust learning. The framework is pretrained on the Google Dynamic World dataset with nine-class noisy annotations and evaluated on the DFC2020 downstream task. Compared to unimodal and conventional noisy-label learning approaches, our method achieves a 4.2% improvement in mean Intersection-over-Union (mIoU), demonstrating the effectiveness and generalizability of cross-modal consistency modeling and noise-robust pretraining.

Technology Category

Application Category

📝 Abstract

We study the potential of noisy labels y to pretrain semantic segmentation models in a multi-modal learning framework for geospatial applications. Specifically, we propose a novel Cross-modal Sample Selection method (CromSS) that utilizes the class distributions P^{(d)}(x,c) over pixels x and classes c modelled by multiple sensors/modalities d of a given geospatial scene. Consistency of predictions across sensors $d$ is jointly informed by the entropy of P^{(d)}(x,c). Noisy label sampling we determine by the confidence of each sensor d in the noisy class label, P^{(d)}(x,c=y(x)). To verify the performance of our approach, we conduct experiments with Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the globally-sampled SSL4EO-S12 dataset. We pair those scenes with 9-class noisy labels sourced from the Google Dynamic World project for pretraining. Transfer learning evaluations (downstream task) on the DFC2020 dataset confirm the effectiveness of the proposed method for remote sensing image segmentation.

Problem

Research questions and friction points this paper is trying to address.

Enhance feature learning using noisy labels for segmentation.

Improve semantic segmentation with cross-modal consistency techniques.

Mitigate label noise effects in multi-modal pretraining frameworks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Sample Selection (CromSS) method

Cross-modal entangling strategy for noise mitigation

Spatial-temporal label smoothing technique

🔎 Similar Papers

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation