GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional satellite-centric cross-view geolocalization methods suffer from insufficient robustness in the absence of high-resolution satellite imagery and struggle to jointly leverage multi-view (e.g., UAV, street-view, satellite) and multimodal (e.g., image, text) cues. To address these limitations, this paper proposes a semantic anchoring mechanism that breaks away from the exclusive satellite-centric paradigm and enables bidirectional image–text cross-modal matching. Built upon a Transformer architecture, our approach integrates multi-view contrastive learning, semantic alignment pretraining, and cross-modal feature fusion. We further introduce GeoLoc—the first large-scale multimodal aligned dataset for geolocalization. Experiments demonstrate substantial improvements in localization accuracy under low-resolution satellite conditions, along with enhanced cross-domain generalization and cross-modal transfer capability. Both the model and the GeoLoc dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.
Problem

Research questions and friction points this paper is trying to address.

Develops a foundation model for cross-view geo-localization
Addresses limitations of satellite-centric approaches with multi-view data
Enables bidirectional matching and language-to-image retrieval for localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-anchor mechanism bridges multi-view features
Bidirectional matching across views and language-to-image retrieval
Large-scale cross-modal multi-view aligned dataset for pre-training
🔎 Similar Papers
No similar papers found.