MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot embodied navigation methods compress visual observations into textual relations, leading to irreversible loss of visual evidence, vocabulary constraints, and high construction overhead. This paper proposes the Multimodal 3D Scene Graph (M3DSG), the first approach to explicitly model joint visual-semantic representations in zero-shot settings. M3DSG employs a vision encoder to extract local geometric and appearance features, then integrates cross-modal alignment and graph neural networks for end-to-end scene graph construction and reasoning. By bypassing text-based semantic compression, M3DSG enables open-vocabulary understanding and zero-shot transfer. On benchmarks including R2R and REVERIE, M3DSG significantly outperforms text-based scene graph methods, achieving a 12.7% absolute improvement in zero-shot navigation success rate. These results demonstrate the critical role of multimodal structured representations in enabling generalization across complex, unseen environments.

Technology Category

Application Category

📝 Abstract
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation
Problem

Research questions and friction points this paper is trying to address.

Addresses irreversible visual loss in text-only scene graphs
Reduces high construction costs of explicit 3D representations
Enables open-vocabulary generalization for embodied navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal 3D Scene Graph preserves visual cues
Replaces textual relations with visual representations
Enables zero-shot embodied navigation with open vocabulary
🔎 Similar Papers
No similar papers found.
Xun Huang
Xun Huang
Unknown affiliation
Generative Models
S
Shijia Zhao
Xiamen University
Y
Yunxiang Wang
Nanyang Technological University
X
Xin Lu
University of Chinese Academy of Sciences
W
Wanfa Zhang
Xiamen University
R
Rongsheng Qu
Beihang University
W
Weixin Li
Beihang University
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision
Chenglu Wen
Chenglu Wen
Professor of Xiamen University
3D visionpoint cloudsmobile mappingrobotics