🤖 AI Summary
Existing zero-shot embodied navigation methods compress visual observations into textual relations, leading to irreversible loss of visual evidence, vocabulary constraints, and high construction overhead. This paper proposes the Multimodal 3D Scene Graph (M3DSG), the first approach to explicitly model joint visual-semantic representations in zero-shot settings. M3DSG employs a vision encoder to extract local geometric and appearance features, then integrates cross-modal alignment and graph neural networks for end-to-end scene graph construction and reasoning. By bypassing text-based semantic compression, M3DSG enables open-vocabulary understanding and zero-shot transfer. On benchmarks including R2R and REVERIE, M3DSG significantly outperforms text-based scene graph methods, achieving a 12.7% absolute improvement in zero-shot navigation success rate. These results demonstrate the critical role of multimodal structured representations in enabling generalization across complex, unseen environments.
📝 Abstract
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation