TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

📅 2024-12-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Task-oriented 3D scene understanding requires joint modeling of spatial hierarchy (room → region → object) and functional affordances, yet existing methods lack explicit coupling between them. Method: We propose the 3D Hierarchical Scene Graph (3DHSG) framework, which jointly learns room classification, region segmentation, and region-/object-level affordance prediction from point clouds and semantic labels via a Transformer-based multi-task model. Contribution/Results: (1) We introduce the first 3DHSG benchmark dataset with fine-grained region- and object-level affordance annotations; (2) we establish a hierarchical 3DHSG generation paradigm that unifies spatial structure and functional semantics; (3) our method achieves significant improvements over state-of-the-art methods across multiple metrics. We publicly release both code and dataset to advance functional and structured 3D scene understanding.

Technology Category

Application Category

📝 Abstract
The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Develops a hierarchical 3D scene graph model
Integrates functional affordance with spatial context
Improves 3D scene understanding using transformer-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical 3D scene graph construction
Transformer-based model integration
Multi-task learning framework
🔎 Similar Papers
No similar papers found.
W
Wenting Xu
School of Electrical and Computer Engineering, The University of Sydney
Viorela Ila
Viorela Ila
The University of Sydney
RoboticsComputer Vision
Luping Zhou
Luping Zhou
School of Electrical and Computer Engineering, University of Sydney
Medical ImagingComputer VisionMachine Learning
C
Craig T. Jin
School of Electrical and Computer Engineering, The University of Sydney