VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of accurate localization from natural language descriptions to 3D point cloud maps by proposing VLM-Loc, a novel approach that leverages vision-language models (VLMs) for cross-modal alignment between text and point clouds. By transforming point clouds into bird’s-eye-view images and structured scene graphs, the method jointly encodes geometric and semantic information. A partial node matching mechanism is introduced to enable interpretable spatial reasoning. Evaluated on the newly constructed CityLoc benchmark, VLM-Loc significantly outperforms existing methods, achieving state-of-the-art performance in both localization accuracy and robustness, while simultaneously enhancing the model’s spatial reasoning capabilities and decision interpretability.

Technology Category

Application Category

πŸ“ Abstract
Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.
Problem

Research questions and friction points this paper is trying to address.

text-to-point-cloud localization
spatial reasoning
vision-language models
3D point cloud maps
natural language descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Point Cloud Localization
Scene Graph
Bird's-Eye-View Representation
Cross-Modal Reasoning
πŸ”Ž Similar Papers
No similar papers found.