๐ค AI Summary
This work addresses the challenge of estimating three-degree-of-freedom camera pose in complex environments by matching ground-level panoramic images with satellite imagery. The authors propose RHO, a novel model that, for the first time, integrates panoramic images with OpenStreetMap to enable robust multi-view camera geolocation (MCVGL). RHO employs a dual-branch Pin-Pan architecture augmented with a Split-Undistort-Merge (SUM) module to effectively correct panoramic distortions and introduces a Position-Orientation Fusion (POF) mechanism to jointly optimize position and orientation estimates. Evaluated on CV-RHOโa newly curated large-scale benchmark datasetโthe model, trained end-to-end, significantly outperforms existing methods, achieving up to a 20% improvement in localization accuracy, thereby demonstrating both the utility of the dataset and the efficacy of the proposed approach.
๐ Abstract
Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: https://github.com/InSAI-Lab/RHO.