Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

๐Ÿ“… 2026-01-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges posed by long-horizon bimanual manipulation tasks, where action sequences, object participation, and interaction geometries exhibit high variability. To this end, the authors propose a semantic-geometric task graph representation that unifies the modeling of semantic task structure and the temporal evolution of geometric relationships among objects within a single graph framework. Leveraging message-passing neural networks (MPNNs), the approach learns structured representations from temporal scene graphs and employs a Transformer decoder to predict future action sequences, associated objects, and motion trajectories conditioned on action context. The proposed framework decouples scene representation from action reasoning, enabling transferable high-level task abstraction. It significantly outperforms sequential models in highly variable tasks and has been successfully deployed on a physical bimanual robot for online action selection.

Technology Category

Application Category

๐Ÿ“ Abstract
Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.
Problem

Research questions and friction points this paper is trying to address.

task representation
semantic-geometric reasoning
human demonstration
bimanual manipulation
structured learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-geometric task graph
message passing neural network
Transformer-based decoder
bimanual manipulation
task representation learning
๐Ÿ”Ž Similar Papers