🤖 AI Summary
To address outdoor text-based localization in GPS-denied urban environments, this paper proposes a lightweight and scalable cross-modal retrieval method bridging scene graphs and semantic maps. The core contribution is the first integration of OpenStreetMap (OSM)—a structured, semantically rich vector map—with natural-language-driven scene graph generation and matching, eliminating the need for storage-intensive point cloud maps. Our approach comprises three key components: OSM semantic parsing, joint text–map embedding, and a lightweight spatial indexing scheme, enabling real-time on-device inference. Evaluated on KITTI360Pose, our method achieves localization accuracy comparable to state-of-the-art point cloud–based baselines. Moreover, it reduces city-scale map storage by over 90% and completes each inference in just a few seconds. The implementation is publicly available.
📝 Abstract
We propose GOTLoc, a robust localization method capable of operating even in outdoor environments where GPS signals are unavailable. The method achieves this robust localization by leveraging comparisons between scene graphs generated from text descriptions and maps. Existing text-based localization studies typically represent maps as point clouds and identify the most similar scenes by comparing embeddings of text and point cloud data. However, point cloud maps have limited scalability as it is impractical to pre-generate maps for all outdoor spaces. Furthermore, their large data size makes it challenging to store and utilize them directly on actual robots. To address these issues, GOTLoc leverages compact data structures, such as scene graphs, to store spatial information, enabling individual robots to carry and utilize large amounts of map data. Additionally, by utilizing publicly available map data, such as OpenStreetMap, which provides global information on outdoor spaces, we eliminate the need for additional effort to create custom map data. For performance evaluation, we utilized the KITTI360Pose dataset in conjunction with corresponding OpenStreetMap data to compare the proposed method with existing approaches. Our results demonstrate that the proposed method achieves accuracy comparable to algorithms relying on point cloud maps. Moreover, in city-scale tests, GOTLoc required significantly less storage compared to point cloud-based methods and completed overall processing within a few seconds, validating its applicability to real-world robotics. Our code is available at https://github.com/donghwijung/GOTLoc.