Task-oriented Sequential Grounding and Navigation in 3D Scenes

📅 2024-08-07
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of multi-step localization and navigation guided by task-oriented language instructions in real-world indoor 3D environments, this paper introduces the first sequential 3D vision-language grounding task. We present SG3D, a large-scale, multi-step, human-verified dataset comprising 22K tasks and 112K steps, capturing fine-grained action–target temporal relations in daily activities. We propose SG-LLM, a stepwise grounding framework that jointly leverages RGB-D scene representations and the incremental reasoning capabilities of large language models to achieve dynamic, context-aware vision-language alignment. Comprehensive evaluation on the SG3D benchmark reveals that existing methods suffer from limited multi-step contextual modeling, whereas SG-LLM achieves substantial improvements—+18.7% in sequential grounding accuracy and +22.3% in navigation success rate. This work establishes a new paradigm for task-level semantic understanding and execution in embodied agents.

Technology Category

Application Category

📝 Abstract
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Dynamic sequential grounding in 3D environments
Task-oriented navigation using step-by-step instructions
Large-scale dataset for 3D scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Task-oriented Sequential Grounding and Navigation
Develops SG3D dataset with 22,346 tasks
Proposes SG-LLM for stepwise grounding in 3D
🔎 Similar Papers
No similar papers found.
Z
Zhuofan Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI, China; Tsinghua University
Z
Ziyu Zhu
State Key Laboratory of General Artificial Intelligence, BIGAI, China; Tsinghua University
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
Tengyu Liu
Tengyu Liu
Beijing Institute for General Artificial Intelligence
computer visionhuman object interactionhuman motion generationgrasping
Xiaojian Ma
Xiaojian Ma
University of California, Los Angeles
Computer VisionMachine LearningGenerative ModelingReinforcement Learning
Y
Yixin Chen
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
S
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI, China