GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

πŸ“… 2026-04-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

193K/year
πŸ€– AI Summary
This study addresses the limitations of existing remote sensing foundation models, which are hindered by the scarcity of large-scale, spatially aligned multimodal data and semantically reliable supervisory signals. To overcome these challenges, the authors introduce GeoMeld, a large-scale remote sensing dataset comprising approximately 2.5 million spatially aligned samples, along with the GeoMeld-FM pretraining framework. This framework pioneers an agent-based approach for generating descriptive captions of remote sensing imagery, integrating spectral, topographic, and geospatial metadata to produce semantically trustworthy automatic annotations. It further establishes a unified multimodal alignment protocol that jointly optimizes masked autoencoding, Joint-Embedding Predictive Architecture (JEPA) representation learning, and image–text contrastive alignment during pretraining. Experimental results demonstrate substantial performance gains on downstream transfer tasks and cross-sensor scenarios, offering a scalable new paradigm for remote sensing foundation models.

Technology Category

Application Category

πŸ“ Abstract
Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.
Problem

Research questions and friction points this paper is trying to address.

foundation models
remote sensing
multimodal alignment
semantic grounding
large-scale dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantically grounded
multimodal alignment
foundation model
remote sensing
agentic captioning
πŸ”Ž Similar Papers
No similar papers found.