MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

πŸ“… 2026-05-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

194K/year
πŸ€– AI Summary
This work addresses the scarcity of paired multimodal remote sensing data and the limited modality scalability and task generalization of existing methods. The authors propose a scene-centric joint modeling paradigm that first learns a unified latent scene representation through a decoupled architecture and subsequently generates any target modality from this representation, enabling arbitrary-to-arbitrary translation across five remote sensing modalities. To support this approach, they construct EarthMM, a large-scale dataset comprising 2.8 million globally distributed multispectral and multiresolution images, and develop the first remote sensing foundation model capable of unified multitask generation. Experiments demonstrate the model’s strong generalization across diverse generation tasks and its effectiveness in enhancing downstream applications such as data augmentation and representation learning.
πŸ“ Abstract
Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.
Problem

Research questions and friction points this paper is trying to address.

multimodal remote sensing
paired image generation
cross-modal translation
scene consistency
foundation model
Innovation

Methods, ideas, or system contributions that make the work stand out.

scene-centered joint modeling
multimodal remote sensing
generative foundation model
any-to-any translation
latent scene representation
πŸ”Ž Similar Papers
No similar papers found.
Zhiping Yu
Zhiping Yu
Beihang University
deep learningremote sensingAIGC
C
Chenyang Liu
Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University, Beijing 100191, China and the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
J
Jinqi Cao
Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University, Beijing 100191, China and the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
Q
Qinzhe Yang
Shenyuan Honors College, Beihang University, Beijing 100191, China
S
Siwei Yu
Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University, Beijing 100191, China and the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
Zhengxia Zou
Zhengxia Zou
Beihang Univeristy
computer visionimage processingremote sensinggames
Zhenwei Shi
Zhenwei Shi
Professor at Image Processing Center, Beihang University, China
Hyperspectral imagingRemote SensingSignal and Image ProcessingPattern RecognitionMachine Learning