TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

📅 2024-12-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
Task-oriented 3D scene understanding requires joint modeling of spatial hierarchy (room → region → object) and functional affordances, yet existing methods lack explicit coupling between them. Method: We propose the 3D Hierarchical Scene Graph (3DHSG) framework, which jointly learns room classification, region segmentation, and region-/object-level affordance prediction from point clouds and semantic labels via a Transformer-based multi-task model. Contribution/Results: (1) We introduce the first 3DHSG benchmark dataset with fine-grained region- and object-level affordance annotations; (2) we establish a hierarchical 3DHSG generation paradigm that unifies spatial structure and functional semantics; (3) our method achieves significant improvements over state-of-the-art methods across multiple metrics. We publicly release both code and dataset to advance functional and structured 3D scene understanding.

Technology Category

Application Category

📝 Abstract
The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Develops a hierarchical 3D scene graph model
Integrates functional affordance with spatial context
Improves 3D scene understanding using transformer-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical 3D scene graph construction
Transformer-based model integration
Multi-task learning framework
🔎 Similar Papers
No similar papers found.