Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of large cross-modal discrepancies and high computational costs in global matching for lightweight, privacy-preserving monocular relocalization using OpenStreetMap (OSM). To this end, we propose a semantic-aware hierarchical coarse-to-fine search framework. Our approach is the first to leverage the semantic perception capability of DINO-ViT for OSM-based relocalization, aligning semantic features between query images and map data while replacing exhaustive global matching with an efficient hierarchical retrieval strategy. Trained on a single dataset, the method achieves a 3° orientation recall that surpasses the 5° performance of current state-of-the-art approaches, significantly improving both localization accuracy and computational efficiency while maintaining a lightweight map representation.

Technology Category

Application Category

📝 Abstract
Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

monocular re-localization
OpenStreetMap
cross-modal discrepancy
computational cost
semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

coarse-to-fine localization
semantic alignment
OpenStreetMap
monocular re-localization
DINO-ViT
🔎 Similar Papers
No similar papers found.
Y
Yuchen Zou
School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
X
Xiao Hu
International Digital Economy Academy, Guangdong, Shenzhen 510085, China
Dexing Zhong
Dexing Zhong
Associate Professor of Automation, Xi'an Jiaotong University
Machine learningComputer VisionImage processing
Yuqing Tang
Yuqing Tang
Facebook AI
Deep LearningMachine TranslationArtificial IntelligenceMultiagent Systems