Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses natural language–driven 3D point cloud submap localization. Methodologically, it proposes a coarse-to-fine cross-modal matching framework that introduces masked instance training and modality-aware hierarchical contrastive learning to enhance language–point cloud robustness. It further designs a lightweight, precise localization architecture—requiring no explicit text-instance alignment—that integrates a pretrained language model, hierarchical Transformers, an attention-based point cloud encoder, and a prototype map cloning with cascaded cross-attention mechanism. Evaluated on KITTI360Pose, the method achieves a 15% improvement over the state of the art. Extensive validation on a newly constructed large-scale urban-scene dataset demonstrates strong generalization to complex, ambiguous natural language descriptions and diverse urban structures. The approach significantly advances semantic-driven 3D spatial retrieval in open-world scenarios.

Technology Category

Application Category

📝 Abstract
We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Localizing 3D point cloud submaps using diverse natural language descriptions
Establishing cross-modal alignment between language and point cloud data
Handling complex linguistic expressions across varied urban environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Transformer with Max pooling for sentence semantics
Masked Instance Training filters non-aligned objects
Prototype-based Map Cloning enables lightweight fine localization
🔎 Similar Papers
No similar papers found.
Y
Yan Xia
School of Artificial Intelligence and Data Science, University of Science and Technology of China, 230026 Hefei, China
L
Letian Shi
Technical University of Munich, 80333 Munich, Germany
Y
Yilin Di
Technical University of Munich, 80333 Munich, Germany
João F. Henriques
João F. Henriques
Visual Geometry Group, University of Oxford
computer visionmachine learningcirculant matricesfourier analysis
Daniel Cremers
Daniel Cremers
Technical University of Munich
Computer VisionMachine LearningOptimizationRobotics