COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of cross-modal representation learning and zero-shot translation among multimodal remote sensing data (optical, SAR, and elevation), this paper proposes a sequential diffusion Transformer architecture. The method integrates a diffusion process with modality-specific time-step embeddings to explicitly model modality–temporal coupling. Its core contributions are: (1) the first framework enabling zero-shot translation from any subset of input modalities to any target modality, without requiring paired training data; and (2) a novel time-embedding design that disentangles modality-specific dynamics within the diffusion denoising process. Trained on the Major TOM dataset, the model achieves high-fidelity cross-modal generation on Copernicus thumbnail imagery. Quantitative and qualitative evaluations demonstrate substantial improvements in cross-modal representation consistency and generalization capability, while maintaining robust generation performance under missing-modality conditions—establishing a new paradigm for unified modeling of heterogeneous remote sensing sources.

Technology Category

Application Category

📝 Abstract
In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.
Problem

Research questions and friction points this paper is trying to address.

Learning unified representation across multi-modal remote sensing data
Enabling zero-shot modality translation for diverse sensor inputs
Overcoming limitations of traditional single/dual-modality approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative diffusion model for multi-modal data
Zero-shot modality translation capability
Sequence-based diffusion transformer architecture
🔎 Similar Papers
No similar papers found.
M
Miguel Espinosa
University of Edinburgh
V
V. Marsocci
European Space Agency (ESA)
Yuru Jia
Yuru Jia
KU Leuven, KTH
Vision PerceptionEarth Observation
E
Elliot Crowley
University of Edinburgh
Mikolaj Czerkawski
Mikolaj Czerkawski
Partner Scientist, Asterisk Labs
Deep LearningComputer VisionInternal LearningSignal ProcessingShort-Range Radar