Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited geographic coverage and insufficient scene diversity of existing cross-modal geolocation methods, which hinder their generalization to a global scale. To this end, the authors introduce CORE, the first large-scale cross-view image-text dataset spanning 225 regions worldwide with over one million samples, and propose PLANET, a physics-aware network that integrates physical consistency constraints with contrastive learning. Leveraging large vision-language models (LVLMs), PLANET generates high-quality scene descriptions in a zero-shot manner to enable effective cross-modal alignment between textual queries and satellite imagery. Experimental results demonstrate that PLANET significantly outperforms current state-of-the-art methods across multiple regions, establishing a new benchmark for global-scale cross-modal geolocation.

Technology Category

Application Category

📝 Abstract
Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANet significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at https://github.com/YtH0823/CORE.
Problem

Research questions and friction points this paper is trying to address.

Cross-modal Geo-localization
Global-scale Dataset
Spatial Heterogeneity
Aerial Imagery
Ground-level Text
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal geo-localization
million-scale dataset
physical consistency learning
large vision-language models
contrastive learning
🔎 Similar Papers
No similar papers found.
Y
Yutong Hu
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
Jinhui Chen
Jinhui Chen
Wakayama University
machine learningspeech processingauditory perceptionimage processing
C
Chaoqiang Xu
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
Y
Yuan Kou
First Surveying and Mapping Institute of Hunan Province, Changsha 421001, China
S
Sili Zhou
First Surveying and Mapping Institute of Hunan Province, Changsha 421001, China
S
Shaocheng Yan
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
P
Pengcheng Shi
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
Q
Qingwu Hu
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
Jiayuan Li
Jiayuan Li
wuhan uniersity
remote sensing, image processing, computer vision