Multimodal Spatial Language Maps for Robot Navigation and Manipulation

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing navigation methods suffer from weak coupling between environmental mapping and language understanding, insufficient geometric accuracy, and inadequate multimodal fusion. To address these issues, this paper proposes a unified multimodal spatial-language mapping framework—VLMaps/AVLMaps—that pioneers the deep integration of pre-trained vision-language (CLIP) and audio-language (Whisper) features into geometric 3D maps constructed via NeRF or TSDF. The framework enables zero-shot localization from visual, linguistic, or auditory inputs to spatial coordinates, and synergizes large language model (LLM)-driven spatial reasoning with online multi-sensor registration. It supports shared semantic mapping and dynamic obstacle modeling across heterogeneous robotic platforms—including mobile bases and robotic arms. Experiments in both simulation and real-world settings demonstrate a 50% improvement in target recall under ambiguous conditions, significantly enhancing robustness and generalization of zero-shot spatial navigation.

Technology Category

Application Category

📝 Abstract
Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals (e.g.,"in between the sofa and TV") directly localized in the map, and (ii) be shared across different robot embodiments to generate tailored obstacle maps on demand. Building upon the capabilities above, AVLMaps extend VLMaps by introducing a unified 3D spatial representation integrating audio, visual, and language cues through the fusion of features from pretrained multimodal foundation models. This enables robots to ground multimodal goal queries (e.g., text, images, or audio snippets) to spatial locations for navigation. Additionally, the incorporation of diverse sensory inputs significantly enhances goal disambiguation in ambiguous environments. Experiments in simulation and real-world settings demonstrate that our multimodal spatial language maps enable zero-shot spatial and multimodal goal navigation and improve recall by 50% in ambiguous scenarios. These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
Problem

Research questions and friction points this paper is trying to address.

Grounding language to robot navigation lacks spatial precision.
Existing methods neglect non-visual modalities like audio.
Current approaches fail to integrate multimodal features with 3D maps.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses multimodal features with 3D reconstruction
Autonomously builds visual-audio-language maps
Enables zero-shot multimodal goal navigation