Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing semantic mapping approaches struggle to jointly represent explicit geometry and multi-scale semantics while lacking native compatibility with large language models. This work proposes Gaussian-Language Maps (GLMap), a unified framework that integrates explicit geometric, instance-level, and region-level semantic information to establish a bimodal interface between natural language and 3D Gaussian representations. Key innovations include the design of bimodal semantic units, a gradient-free analytical estimation method for Gaussian parameters, and an efficient incremental mapping mechanism based on 3D Gaussian splatting. The resulting map enables zero-shot embodied navigation and reasoning, significantly improving performance in object localization and contextual understanding on ObjectNav, InstNav, and SQA tasks, while offering plug-and-play compatibility with large language models.

📝 Abstract

Understanding the geometric and semantic structure of environments is essential for embodied navigation and reasoning. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics, and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target navigation and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner. The code is available at https://github.com/sx-zhang/GLMap.

Problem

Research questions and friction points this paper is trying to address.

semantic mapping

embodied navigation

zero-shot reasoning

multi-scale semantics

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting

Semantic Mapping

Zero-shot Navigation