GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work systematically investigates the capabilities of large multimodal models (LMMs) on cross-view geolocalization and pose estimation—a critical yet underexplored navigation-level spatial perception task. To this end, we introduce GeoX-Bench, the first dedicated benchmark comprising 10,859 panoramic-satellite image pairs spanning 128 cities across 49 countries, along with 756,000 structured question-answer samples. Leveraging precise geographic registration and a standardized evaluation protocol, we comprehensively assess 25 state-of-the-art LMMs. Results reveal robust performance in coarse geolocalization but substantial limitations in fine-grained pose estimation (e.g., heading and pitch angles). To address this gap, we propose a spatial-aware instruction-tuning strategy that significantly enhances cross-view geometric reasoning. This study fills a critical void in LMM evaluation for navigation-grade spatial understanding, providing both a reproducible benchmark and a methodological framework for capability diagnosis and targeted improvement.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, extit{etc}. To bridge this gap, we introduce extbf{GeoX-Bench}, a comprehensive underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in underline{cross}-view underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at extcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' cross-view geo-localization capabilities using panoramic-satellite image pairs

Assessing pose estimation abilities of large multimodal models across diverse geographic locations

Benchmarking 25 state-of-the-art LMMs on complex geospatial reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoX-Bench benchmark for cross-view geo-localization

Evaluates 25 LMMs using panoramic-satellite image pairs

Instruction-tuning enhances cross-view geo-sense abilities

🔎 Similar Papers

No similar papers found.

Authors to Follow