Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses natural language–driven 3D point cloud submap localization. Methodologically, it proposes a coarse-to-fine cross-modal matching framework that introduces masked instance training and modality-aware hierarchical contrastive learning to enhance language–point cloud robustness. It further designs a lightweight, precise localization architecture—requiring no explicit text-instance alignment—that integrates a pretrained language model, hierarchical Transformers, an attention-based point cloud encoder, and a prototype map cloning with cascaded cross-attention mechanism. Evaluated on KITTI360Pose, the method achieves a 15% improvement over the state of the art. Extensive validation on a newly constructed large-scale urban-scene dataset demonstrates strong generalization to complex, ambiguous natural language descriptions and diverse urban structures. The approach significantly advances semantic-driven 3D spatial retrieval in open-world scenarios.

Technology Category

Application Category

📝 Abstract

We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Localizing 3D point cloud submaps using diverse natural language descriptions

Establishing cross-modal alignment between language and point cloud data

Handling complex linguistic expressions across varied urban environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Transformer with Max pooling for sentence semantics

Masked Instance Training filters non-aligned objects

Prototype-based Map Cloning enables lightweight fine localization

🔎 Similar Papers

No similar papers found.