🤖 AI Summary
Existing 3D scene graph generation methods are constrained by single-view inputs, lack of geometric grounding, and inability to support incremental updates—limitations that hinder real-world embodied intelligence applications. This paper introduces the first zero-shot, incrementally updatable open-vocabulary 3D scene graph generation framework. Without fine-tuning, it synergistically integrates the semantic understanding capabilities of pre-trained vision-language models with geometric grounding from depth maps, dynamically projecting 2D scene graphs into 3D space. Nodes encode both object semantics and 3D spatial coordinates, while edges explicitly represent spatial and semantic relationships. Evaluated on Replica and HM3D datasets, our method significantly improves zero-shot inference accuracy for object relations and structural understanding in complex 3D environments. It establishes a novel paradigm for real-time, adaptive scene understanding in robotic systems.
📝 Abstract
Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.