Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study addresses the challenge of effectively modeling pixel-level temporal dynamics in satellite image time series (SITS) with existing deep learning approaches. To this end, the authors propose a pixel-level multimodal contrastive learning framework that, for the first time, transforms time series of vegetation indices—such as NDVI, EVI, and SAVI—into two-dimensional recurrence plots, which are then combined with original remote sensing images to form a multimodal input. A self-supervised contrastive learning strategy is employed to jointly optimize feature representations across modalities. The proposed method consistently outperforms current state-of-the-art approaches on both pixel-wise prediction and land cover classification tasks evaluated on the PASTIS and EuroSAT datasets, demonstrating the effectiveness and superiority of the introduced recurrence plot representation and multimodal self-supervised paradigm.

Technology Category

Application Category

📝 Abstract

Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on

Problem

Research questions and friction points this paper is trying to address.

remote sensing images

satellite image time series

pixel-wise representation

multimodal learning

self-supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

pixel-wise contrastive learning

multimodal self-supervision

recurrence plots