Open-set 3D semantic instance maps for vision language navigation – O3D-SIM

📅 2024-04-27
🏛️ Adv. Robotics
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses language-query-based embodied vision-language navigation under open-set conditions. Methodologically, it introduces a 3D Open-Set Semantic Instance Mapping (O3D-SIM) framework that integrates multimodal foundation models—specifically CLIP for cross-modal alignment and zero-shot object recognition, and SAM for image-level segmentation—combined with SLAM-based pose estimation and point-cloud instance clustering to construct an open-set, instance-level, semantically enriched 3D map. The core contribution is the first realization of open-set 3D instance semantic mapping capable of supporting language queries involving previously unseen object categories, thereby overcoming the limitations of conventional closed-set semantic mapping. Experimental results demonstrate substantial improvements in language-guided navigation success rates; qualitative analysis further confirms strong generalization capability and interpretability of the generated maps.

Technology Category

Application Category

📝 Abstract
Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps (Nanwani L, Agarwal A, Jain K, et al. Instance-level semantic maps for vision language navigation. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE; 2023 Aug.) showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify. Project Page - https://smart-wheelchair-rrc.github.io/o3d-sim-webpage GRAPHICAL ABSTRACT
Problem

Research questions and friction points this paper is trying to address.

Creating 3D semantic instance maps for language-guided navigation tasks
Improving object recognition robustness using open-set foundational models
Enhancing success rates of vision-language navigation through instance embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends instance-level semantic mapping to 3D
Leverages foundational models for recognition and segmentation
Generates queryable 3D point clouds with instance embeddings
🔎 Similar Papers
No similar papers found.
L
Laksh Nanwani
International Institute of Information Technology, Hyderabad, India
Kumaraditya Gupta
Kumaraditya Gupta
PhD Student, Mila, Université de Montréal
RoboticsComputer Vision3D Vision
Aditya Mathur
Aditya Mathur
Purdue University and Singapore University of Technology and Design
Critical Infrastructure SecuritySoftware testingsoftware reliabilitysecurity of cyber physical systems
S
Swayam Agrawal
International Institute of Information Technology, Hyderabad, India
A
A. Hafez
Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey
K
K. M. Krishna
International Institute of Information Technology, Hyderabad, India