Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Service robots lack social situational understanding, hindering their ability to interpret human attributes, activities, and interpersonal or human-object interactions in 3D environments. Method: We propose Social 3D Scene Graphs (S3SG), an enhanced 3D scene graph representation supporting open-vocabulary semantics and multi-scale relational modeling. S3SG is the first to jointly encode human attributes, activities, and both local and long-range human-human and human-object interactions within a unified 3D spatial framework. It integrates multi-frame temporal information and synthetic data generation to construct SocialScene3Dβ€”the first large-scale synthetic benchmark for complex social reasoning, featuring fine-grained behavioral and relational annotations. Contribution/Results: Experiments demonstrate that S3SG significantly improves human activity prediction and social relation inference, achieving state-of-the-art performance on multiple open-vocabulary social query tasks. This work establishes a scalable, structured cognitive foundation for socially intelligent robots.

Technology Category

Application Category

πŸ“ Abstract
Understanding how people interact with their surroundings and each other is essential for enabling robots to act in socially compliant and context-aware ways. While 3D Scene Graphs have emerged as a powerful semantic representation for scene understanding, existing approaches largely ignore humans in the scene, also due to the lack of annotated human-environment relationships. Moreover, existing methods typically capture only open-vocabulary relations from single image frames, which limits their ability to model long-range interactions beyond the observed content. We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote, using an open-vocabulary framework. Furthermore, we introduce a new benchmark consisting of synthetic environments with comprehensive human-scene relationship annotations and diverse types of queries for evaluating social scene understanding in 3D. The experiments demonstrate that our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.
Problem

Research questions and friction points this paper is trying to address.

Modeling human actions and relations for interactive service robots
Capturing long-range human-environment interactions beyond single frames
Addressing lack of annotated human-scene relationship data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented 3D Scene Graphs modeling human attributes and relationships
Open-vocabulary framework capturing local and remote human interactions
Synthetic benchmark with comprehensive human-scene relationship annotations
πŸ”Ž Similar Papers
No similar papers found.
E
Ermanno Bartoli
Faculty of Robotics Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden
D
Dennis Rotondi
Socially Intelligent Robotics Lab, Institute for Artificial Intelligence, University of Stuttgart, Germany
B
Buwei He
Faculty of Robotics Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden
Patric Jensfelt
Patric Jensfelt
KTH Royal Institute of Technology
robotics
Kai O. Arras
Kai O. Arras
Professor of Autonomous Systems
RoboticsSocial RoboticsHuman-Robot InteractionArtificial IntelligenceComputer Vision
Iolanda Leite
Iolanda Leite
Associate Professor at KTH Royal Institute of Technology
Human-Robot InteractionArtificial IntelligenceSocial RoboticsMultimodal Interaction