Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D vision-language models are limited to static object localization and lack active exploration capabilities. This paper proposes an online active perception framework for embodied agents, unifying visual grounding and exploration decision-making for the first time. Our approach constructs spatial memory via online query representation learning—eliminating explicit 3D reconstruction—and introduces an end-to-end trajectory learning paradigm that jointly processes RGB-D inputs and leverages vision-language-exploration multimodal pretraining. We optimize navigation policies using over one million simulated and real-world trajectory samples. Evaluated on HM3D-OVON, GOAT-Bench, and other benchmarks, our method achieves absolute improvements of 14%, 23%, 9%, and 2% in task success rates, respectively, significantly enhancing generalizable multimodal navigation performance in complex, dynamic environments.

Technology Category

Application Category

📝 Abstract
Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce underline{ extbf{M}}ove underline{ extbf{t}}o underline{ extbf{U}}nderstand ( extbf{model}), a unified framework that integrates active perception with underline{ extbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines extbf{V}ision- extbf{L}anguage- extbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14%, 23%, 9%, and 2% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Integrates active perception with 3D vision-language learning
Enables embodied agents to explore and understand environments
Improves success rates in navigation and question-answering benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online query-based representation learning from RGB-D frames
Unified objective for grounding and exploring via frontier queries
End-to-end Vision-Language-Exploration pre-training on trajectories
🔎 Similar Papers
No similar papers found.
Z
Ziyu Zhu
Tsinghua University, State Key Laboratory of General Artificial Intelligence, BIGAI, China
X
Xilin Wang
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Y
Yixuan Li
Beijing Institute of Technology, State Key Laboratory of General Artificial Intelligence, BIGAI, China
Z
Zhuofan Zhang
Tsinghua University, State Key Laboratory of General Artificial Intelligence, BIGAI, China
Xiaojian Ma
Xiaojian Ma
University of California, Los Angeles
Computer VisionMachine LearningGenerative ModelingReinforcement Learning
Y
Yixin Chen
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
W
Wei Liang
Beijing Institute of Technology
Qian Yu
Qian Yu
Professor, Dept of Earth, Geographic, and Climate Sciences, University of Massachusetts-Amherst
GISremote sensingSpatial modeling
Zhidong Deng
Zhidong Deng
Professor of Computer Science, Tsinghua University
Artificial IntelligenceSelf-drivingRoboticsIoTComputational Biology
S
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI, China