GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

πŸ“… 2025-11-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work systematically investigates the capabilities of large multimodal models (LMMs) on cross-view geolocalization and pose estimationβ€”a critical yet underexplored navigation-level spatial perception task. To this end, we introduce GeoX-Bench, the first dedicated benchmark comprising 10,859 panoramic-satellite image pairs spanning 128 cities across 49 countries, along with 756,000 structured question-answer samples. Leveraging precise geographic registration and a standardized evaluation protocol, we comprehensively assess 25 state-of-the-art LMMs. Results reveal robust performance in coarse geolocalization but substantial limitations in fine-grained pose estimation (e.g., heading and pitch angles). To address this gap, we propose a spatial-aware instruction-tuning strategy that significantly enhances cross-view geometric reasoning. This study fills a critical void in LMM evaluation for navigation-grade spatial understanding, providing both a reproducible benchmark and a methodological framework for capability diagnosis and targeted improvement.

Technology Category

Application Category

πŸ“ Abstract
Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, extit{etc}. To bridge this gap, we introduce extbf{GeoX-Bench}, a comprehensive underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in underline{cross}-view underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at extcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' cross-view geo-localization capabilities using panoramic-satellite image pairs
Assessing pose estimation abilities of large multimodal models across diverse geographic locations
Benchmarking 25 state-of-the-art LMMs on complex geospatial reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoX-Bench benchmark for cross-view geo-localization
Evaluates 25 LMMs using panoramic-satellite image pairs
Instruction-tuning enhances cross-view geo-sense abilities
πŸ”Ž Similar Papers
No similar papers found.
Y
Yushuo Zheng
Shanghai Jiao Tong University
J
Jiangyong Ying
China Telecom
Huiyu Duan
Huiyu Duan
Shanghai Jiao Tong University
Multimedia Signal Processing
Chunyi Li
Chunyi Li
NTU | SJTU | Shanghai AI Lab
Generative AIEmbodied AILow-level Vision
Z
Zicheng Zhang
Shanghai Jiao Tong University
J
Jing Liu
Tianjin University
X
Xiaohong Liu
Shanghai Jiao Tong University Sichuan Research Institute
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays