🤖 AI Summary
This work addresses the challenging cross-modal, cross-view visual localization problem of matching a single front-facing image with OpenStreetMap (OSM) vector maps. We propose a brain-inspired geometric-semantic co-guided localization paradigm. Methodologically: (1) we design a geometry-guided depth distribution adapter that enables differentiable alignment from monocular depth to bird’s-eye view (BEV) map space—the first such formulation; (2) we construct OSM semantic embeddings to enhance image–map feature matching; and (3) we integrate vision foundation models, monocular depth estimation, and BEV-transformed features. Evaluated on MGL, KITTI, and our newly established global CC benchmark, our method achieves a 21.3% improvement in localization accuracy, a 37.5% gain in cross-weather and cross-illumination generalization, and enables zero-shot city transfer—significantly outperforming state-of-the-art approaches.
📝 Abstract
OpenStreetMap (OSM), a rich and versatile source of volunteered geographic information (VGI), facilitates human self-localization and scene understanding by integrating nearby visual observations with vectorized map data. However, the disparity in modalities and perspectives poses a major challenge for effectively matching camera imagery with compact map representations, thereby limiting the full potential of VGI data in real-world localization applications. Inspired by the fact that the human brain relies on the fusion of geometric and semantic understanding for spatial localization tasks, we propose the OSMLoc in this paper. OSMLoc is a brain-inspired visual localization approach based on first-person-view images against the OSM maps. It integrates semantic and geometric guidance to significantly improve accuracy, robustness, and generalization capability. First, we equip the OSMLoc with the visual foundational model to extract powerful image features. Second, a geometry-guided depth distribution adapter is proposed to bridge the monocular depth estimation and camera-to-BEV transform. Thirdly, the semantic embeddings from the OSM data are utilized as auxiliary guidance for image-to-OSM feature matching. To validate the proposed OSMLoc, we collect a worldwide cross-area and cross-condition (CC) benchmark for extensive evaluation. Experiments on the MGL dataset, CC validation benchmark, and KITTI dataset have demonstrated the superiority of our method. Code, pre-trained models, CC validation benchmark, and additional results are available at: https://github.com/WHU-USI3DV/OSMLoc.