Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
Existing semantic mapping approaches struggle to jointly represent explicit geometry and multi-scale semantics while lacking native compatibility with large language models. This work proposes Gaussian-Language Maps (GLMap), a unified framework that integrates explicit geometric, instance-level, and region-level semantic information to establish a bimodal interface between natural language and 3D Gaussian representations. Key innovations include the design of bimodal semantic units, a gradient-free analytical estimation method for Gaussian parameters, and an efficient incremental mapping mechanism based on 3D Gaussian splatting. The resulting map enables zero-shot embodied navigation and reasoning, significantly improving performance in object localization and contextual understanding on ObjectNav, InstNav, and SQA tasks, while offering plug-and-play compatibility with large language models.
📝 Abstract
Understanding the geometric and semantic structure of environments is essential for embodied navigation and reasoning. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics, and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target navigation and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner. The code is available at https://github.com/sx-zhang/GLMap.
Problem

Research questions and friction points this paper is trying to address.

semantic mapping
embodied navigation
zero-shot reasoning
multi-scale semantics
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting
Semantic Mapping
Zero-shot Navigation
Multi-scale Semantics
Embodied Reasoning
🔎 Similar Papers
No similar papers found.
S
Sixian Zhang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, Beijing
Yiyao Wang
Yiyao Wang
State Key Lab of CAD&CG, Zhejiang University
visualization
X
Xinhang Song
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, Beijing
K
Keming Zhang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, Beijing
Z
Zijian Xu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, Beijing
Shuqiang Jiang
Shuqiang Jiang
Institute of Computing Technology, Chinese Academy of Sciences
Multimedia AnalysisVisual Understanding and RetrievalMultimodal Intelligence