GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing geospatial foundation models (GeoFMs) over-rely on remote sensing data, lack unified multimodal representation of ground-level imagery (e.g., street views), and fail to explicitly model cross-modal geospatial relationships—limiting their generalization across multi-task, multi-scale, and multi-temporal scenarios. To address this, we propose the first multimodal GeoFM jointly modeling remote sensing imagery, street-level images, and geographic coordinates. Our method introduces a Geographically Aligned Implicit Neural Representation (GAIR) module for continuous, precise localization of street views within the remote sensing geospatial reference frame; a decoupled triple-encoder architecture; geographic coordinate embedding; and unsupervised cross-modal contrastive learning. Evaluated on ten diverse tasks—including remote sensing analysis, street-view understanding, and location-aware prediction—our model consistently surpasses state-of-the-art methods. It achieves significant gains in cross-scale, cross-temporal, and cross-task transfer performance, demonstrating enhanced generalization capacity across heterogeneous geospatial modalities.

Technology Category

Application Category

📝 Abstract
Advancements in vision and language foundation models have inspired the development of geo-foundation models (GeoFMs), enhancing performance across diverse geospatial tasks. However, many existing GeoFMs primarily focus on overhead remote sensing (RS) data while neglecting other data modalities such as ground-level imagery. A key challenge in multimodal GeoFM development is to explicitly model geospatial relationships across modalities, which enables generalizability across tasks, spatial scales, and temporal contexts. To address these limitations, we propose GAIR, a novel multimodal GeoFM architecture integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. We utilize three factorized neural encoders to project an SV image, its geolocation, and an RS image into the embedding space. The SV image needs to be located within the RS image's spatial footprint but does not need to be at its geographic center. In order to geographically align the SV image and RS image, we propose a novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image's geolocation. Next, these geographically aligned SV embedding, RS embedding, and location embedding are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 10 geospatial tasks spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art GeoFMs and other strong baselines, highlighting its effectiveness in learning generalizable and transferable geospatial representations.
Problem

Research questions and friction points this paper is trying to address.

Develops multimodal geo-foundation model integrating RS and SV imagery
Addresses geospatial alignment challenge across diverse data modalities
Enhances generalizability across tasks, scales, and temporal contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates overhead RS and street view imagery
Uses implicit neural representations for alignment
Trains with contrastive learning on unlabeled data
🔎 Similar Papers
No similar papers found.