๐ค AI Summary
High-order natural language instructions pose significant challenges for embodied grounding in 3D scenes due to their abstractness and lack of explicit spatial constraints.
Method: This paper introduces the first scene-graph-driven automatic hierarchical task analysis framework. It jointly optimizes environment-dependent task decomposition and scene representation via alternating iterations between large language models (LLMs) and task-oriented 3D scene graph construction. A graph neural network (GNN) is employed for task-driven relational modeling of 3D scenes, and an iterative taskโscene co-optimization mechanism is designed to overcome LLMsโ inherent limitations in spatial reasoning.
Contribution/Results: Experiments demonstrate that our method significantly outperforms pure-LLM baselines in task decomposition accuracy and achieves state-of-the-art performance in subtask grounding within 3D scenes, validating the effectiveness and advancement of hierarchical, grounded task planning.
๐ Abstract
While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.