NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN

πŸ“… 2024-11-12
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address zero-shot object-centric navigation in multi-floor, large-scale environments guided by natural language instructions (e.g., β€œgo to the wooden desk”), this paper proposes NL-SLAMβ€”the first natural language-driven end-to-end SLAM framework. Our method jointly integrates a pre-trained vision-language model (CLIP), geometric SLAM, active perception, and cross-modal alignment to enable online, task-agnostic mapping of language instructions onto spatial-semantic maps. Key contributions include: (1) introducing OC-VLN, the first benchmark for object-centric vision-language navigation; (2) achieving significant performance gains over both Object Goal Navigation and Vision-Language Navigation baselines on OC-VLN; and (3) successfully deploying NL-SLAM on a Boston Dynamics Spot robot, demonstrating robust real-world operation across complex indoor environments. NL-SLAM requires no instruction-specific fine-tuning and operates entirely online, enabling scalable, language-grounded spatial understanding without prior task supervision.

Technology Category

Application Category

πŸ“ Abstract
Landmark-based navigation (e.g. go to the wooden desk) and relative positional navigation (e.g. move 5 meters forward) are distinct navigation challenges solved very differently in existing robotics navigation methodology. We present a new dataset, OC-VLN, in order to distinctly evaluate grounding object-centric natural language navigation instructions in a method for performing landmark-based navigation. We also propose Natural Language grounded SLAM (NL-SLAM), a method to ground natural language instruction to robot observations and poses. We actively perform NL-SLAM in order to follow object-centric natural language navigation instructions. Our methods leverage pre-trained vision and language foundation models and require no task-specific training. We construct two strong baselines from state-of-the-art methods on related tasks, Object Goal Navigation and Vision Language Navigation, and we show that our approach, NL-SLAM, outperforms these baselines across all our metrics of success on OC-VLN. Finally, we successfully demonstrate the effectiveness of NL-SLAM for performing navigation instruction following in the real world on a Boston Dynamics Spot robot.
Problem

Research questions and friction points this paper is trying to address.

Grounding natural language instructions in 3D maps
Enabling zero-shot navigation in novel environments
Improving object-centric instruction following in real-world robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D graph mapping with robot poses
Uses Language-Inferred Factor Graph for instructions
Enables real-world navigation on Boston Dynamics Spot
πŸ”Ž Similar Papers
No similar papers found.