MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing zero-shot embodied navigation methods compress visual observations into textual relations, leading to irreversible loss of visual evidence, vocabulary constraints, and high construction overhead. This paper proposes the Multimodal 3D Scene Graph (M3DSG), the first approach to explicitly model joint visual-semantic representations in zero-shot settings. M3DSG employs a vision encoder to extract local geometric and appearance features, then integrates cross-modal alignment and graph neural networks for end-to-end scene graph construction and reasoning. By bypassing text-based semantic compression, M3DSG enables open-vocabulary understanding and zero-shot transfer. On benchmarks including R2R and REVERIE, M3DSG significantly outperforms text-based scene graph methods, achieving a 12.7% absolute improvement in zero-shot navigation success rate. These results demonstrate the critical role of multimodal structured representations in enabling generalization across complex, unseen environments.

Technology Category

Application Category

📝 Abstract

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation

Problem

Research questions and friction points this paper is trying to address.

Addresses irreversible visual loss in text-only scene graphs

Reduces high construction costs of explicit 3D representations

Enables open-vocabulary generalization for embodied navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal 3D Scene Graph preserves visual cues

Replaces textual relations with visual representations

Enables zero-shot embodied navigation with open vocabulary

🔎 Similar Papers

No similar papers found.