π€ AI Summary
This work systematically investigates the capabilities of large multimodal models (LMMs) on cross-view geolocalization and pose estimationβa critical yet underexplored navigation-level spatial perception task. To this end, we introduce GeoX-Bench, the first dedicated benchmark comprising 10,859 panoramic-satellite image pairs spanning 128 cities across 49 countries, along with 756,000 structured question-answer samples. Leveraging precise geographic registration and a standardized evaluation protocol, we comprehensively assess 25 state-of-the-art LMMs. Results reveal robust performance in coarse geolocalization but substantial limitations in fine-grained pose estimation (e.g., heading and pitch angles). To address this gap, we propose a spatial-aware instruction-tuning strategy that significantly enhances cross-view geometric reasoning. This study fills a critical void in LMM evaluation for navigation-grade spatial understanding, providing both a reproducible benchmark and a methodological framework for capability diagnosis and targeted improvement.
π Abstract
Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, extit{etc}. To bridge this gap, we introduce extbf{GeoX-Bench}, a comprehensive underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in underline{cross}-view underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at extcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.