3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in zero-shot object navigation where low-level perceptual errors hinder high-level decision-making. To this end, it introduces 3D Gaussian Splatting (3DGS) into the task for the first time, enabling the construction of a persistent visual memory. By integrating free-viewpoint rendering, active viewpoint switching, and structured Chain-of-Thought prompting, the approach enhances the spatial reasoning and target localization capabilities of vision-language models. The method fuses real-time object detection with trajectory-guided rendering, achieving state-of-the-art performance across multiple simulation benchmarks and demonstrating robust success in real-world experiments on a quadruped robot, significantly improving both navigation success rate and robustness.

Technology Category

Application Category

📝 Abstract
Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/
Problem

Research questions and friction points this paper is trying to address.

object navigation
vision-language models
zero-shot learning
spatial reasoning
3D representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
Vision-Language Model
Zero-Shot Object Navigation
Active Perception
Chain-of-Thought Prompting
🔎 Similar Papers
No similar papers found.
W
Wancai Zheng
Zhejiang University of Technology, Hangzhou, China
Hao Chen
Hao Chen
Zhejiang University
Computer Science
X
Xianlong Lu
Zhejiang University of Technology, Hangzhou, China
L
Linlin Ou
Zhejiang University of Technology, Hangzhou, China
Xinyi Yu
Xinyi Yu
浙江工业大学
智能机器人、具身智能