π€ AI Summary
To address zero-shot object-centric navigation in multi-floor, large-scale environments guided by natural language instructions (e.g., βgo to the wooden deskβ), this paper proposes NL-SLAMβthe first natural language-driven end-to-end SLAM framework. Our method jointly integrates a pre-trained vision-language model (CLIP), geometric SLAM, active perception, and cross-modal alignment to enable online, task-agnostic mapping of language instructions onto spatial-semantic maps. Key contributions include: (1) introducing OC-VLN, the first benchmark for object-centric vision-language navigation; (2) achieving significant performance gains over both Object Goal Navigation and Vision-Language Navigation baselines on OC-VLN; and (3) successfully deploying NL-SLAM on a Boston Dynamics Spot robot, demonstrating robust real-world operation across complex indoor environments. NL-SLAM requires no instruction-specific fine-tuning and operates entirely online, enabling scalable, language-grounded spatial understanding without prior task supervision.
π Abstract
Landmark-based navigation (e.g. go to the wooden desk) and relative positional navigation (e.g. move 5 meters forward) are distinct navigation challenges solved very differently in existing robotics navigation methodology. We present a new dataset, OC-VLN, in order to distinctly evaluate grounding object-centric natural language navigation instructions in a method for performing landmark-based navigation. We also propose Natural Language grounded SLAM (NL-SLAM), a method to ground natural language instruction to robot observations and poses. We actively perform NL-SLAM in order to follow object-centric natural language navigation instructions. Our methods leverage pre-trained vision and language foundation models and require no task-specific training. We construct two strong baselines from state-of-the-art methods on related tasks, Object Goal Navigation and Vision Language Navigation, and we show that our approach, NL-SLAM, outperforms these baselines across all our metrics of success on OC-VLN. Finally, we successfully demonstrate the effectiveness of NL-SLAM for performing navigation instruction following in the real world on a Boston Dynamics Spot robot.