🤖 AI Summary
To address the challenge of multi-step localization and navigation guided by task-oriented language instructions in real-world indoor 3D environments, this paper introduces the first sequential 3D vision-language grounding task. We present SG3D, a large-scale, multi-step, human-verified dataset comprising 22K tasks and 112K steps, capturing fine-grained action–target temporal relations in daily activities. We propose SG-LLM, a stepwise grounding framework that jointly leverages RGB-D scene representations and the incremental reasoning capabilities of large language models to achieve dynamic, context-aware vision-language alignment. Comprehensive evaluation on the SG3D benchmark reveals that existing methods suffer from limited multi-step contextual modeling, whereas SG-LLM achieves substantial improvements—+18.7% in sequential grounding accuracy and +22.3% in navigation success rate. This work establishes a new paradigm for task-level semantic understanding and execution in embodied agents.
📝 Abstract
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.