ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

๐Ÿ“… 2025-04-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
High-order natural language instructions pose significant challenges for embodied grounding in 3D scenes due to their abstractness and lack of explicit spatial constraints. Method: This paper introduces the first scene-graph-driven automatic hierarchical task analysis framework. It jointly optimizes environment-dependent task decomposition and scene representation via alternating iterations between large language models (LLMs) and task-oriented 3D scene graph construction. A graph neural network (GNN) is employed for task-driven relational modeling of 3D scenes, and an iterative taskโ€“scene co-optimization mechanism is designed to overcome LLMsโ€™ inherent limitations in spatial reasoning. Contribution/Results: Experiments demonstrate that our method significantly outperforms pure-LLM baselines in task decomposition accuracy and achieves state-of-the-art performance in subtask grounding within 3D scenes, validating the effectiveness and advancement of hierarchical, grounded task planning.

Technology Category

Application Category

๐Ÿ“ Abstract
While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Grounding abstract high-level instructions to 3D scenes
Breaking high-level tasks into environment-dependent subtasks
Generating task hierarchies grounded in 3D scene graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates task hierarchy from 3D scene
Alternates LLM task analysis with scene graph
Breaks high-level tasks into grounded subtasks
๐Ÿ”Ž Similar Papers
No similar papers found.