π€ AI Summary
Service robots lack social situational understanding, hindering their ability to interpret human attributes, activities, and interpersonal or human-object interactions in 3D environments. Method: We propose Social 3D Scene Graphs (S3SG), an enhanced 3D scene graph representation supporting open-vocabulary semantics and multi-scale relational modeling. S3SG is the first to jointly encode human attributes, activities, and both local and long-range human-human and human-object interactions within a unified 3D spatial framework. It integrates multi-frame temporal information and synthetic data generation to construct SocialScene3Dβthe first large-scale synthetic benchmark for complex social reasoning, featuring fine-grained behavioral and relational annotations. Contribution/Results: Experiments demonstrate that S3SG significantly improves human activity prediction and social relation inference, achieving state-of-the-art performance on multiple open-vocabulary social query tasks. This work establishes a scalable, structured cognitive foundation for socially intelligent robots.
π Abstract
Understanding how people interact with their surroundings and each other is essential for enabling robots to act in socially compliant and context-aware ways. While 3D Scene Graphs have emerged as a powerful semantic representation for scene understanding, existing approaches largely ignore humans in the scene, also due to the lack of annotated human-environment relationships. Moreover, existing methods typically capture only open-vocabulary relations from single image frames, which limits their ability to model long-range interactions beyond the observed content. We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote, using an open-vocabulary framework. Furthermore, we introduce a new benchmark consisting of synthetic environments with comprehensive human-scene relationship annotations and diverse types of queries for evaluating social scene understanding in 3D. The experiments demonstrate that our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.