HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions

📅 2025-06-24
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
Existing indoor scene understanding datasets lack fine-grained modeling of human involvement—particularly human-object interactions—hindering embodied agents’ accurate recognition of actions and relational semantics during navigation and planning. To address this, we introduce the first synthetic dataset integrating scene graphs with parametric human-object interaction modeling. It defines dense, unambiguous spatial and functional relations (e.g., “hand grasping cup handle”) and provides multimodal annotations: RGB, depth, instance segmentation, and 3D human keypoints. By explicitly parameterizing structural relationships among objects and between humans and objects, our dataset bridges a critical gap in human-centric scene understanding. We evaluate on state-of-the-art scene graph generation models, demonstrating significant improvements in human-object interaction recognition and relational prediction accuracy. This work establishes a new benchmark and methodological foundation for fine-grained, embodied scene understanding.

Technology Category

Application Category

📝 Abstract
When humans and robotic agents coexist in an environment, scene understanding becomes crucial for the agents to carry out various downstream tasks like navigation and planning. Hence, an agent must be capable of localizing and identifying actions performed by the human. Current research lacks reliable datasets for performing scene understanding within indoor environments where humans are also a part of the scene. Scene Graphs enable us to generate a structured representation of a scene or an image to perform visual scene understanding. To tackle this, we present HOIverse a synthetic dataset at the intersection of scene graph and human-object interaction, consisting of accurate and dense relationship ground truths between humans and surrounding objects along with corresponding RGB images, segmentation masks, depth images and human keypoints. We compute parametric relations between various pairs of objects and human-object pairs, resulting in an accurate and unambiguous relation definitions. In addition, we benchmark our dataset on state-of-the-art scene graph generation models to predict parametric relations and human-object interactions. Through this dataset, we aim to accelerate research in the field of scene understanding involving people.
Problem

Research questions and friction points this paper is trying to address.

Lack reliable datasets for indoor scene understanding with humans
Need structured scene representation for human-object interaction analysis
Require accurate relation definitions between humans and surrounding objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset combining scene graphs and human-object interactions
Parametric relations for accurate and unambiguous definitions
Benchmarked on state-of-the-art scene graph generation models
🔎 Similar Papers
No similar papers found.
M
Mrunmai Vivek Phatak
Machine Learning and Computer Vision Lab, UniversitÀt Augsburg, Augsburg, Germany
Julian Lorenz
Julian Lorenz
University of Augsburg
Computer Vision
N
Nico Hörmann
Machine Learning and Computer Vision Lab, UniversitÀt Augsburg, Augsburg, Germany
Jörg HÀhner
Jörg HÀhner
UniversitÀt Augsburg
Organic Computingdistributed systemscomputer networksmachine learning
Rainer Lienhart
Rainer Lienhart
Professor of Computer Science, University of Augsburg
Machine LearningComputer VisionMultimedia Computing