🤖 AI Summary
This work addresses the challenges of open-vocabulary 3D scene graph generation, which suffers from low object recognition accuracy and poor computational efficiency due to view occlusions and surface redundancies. The authors propose a re-projection-guided uncertainty estimation mechanism that effectively suppresses noise during cross-view feature aggregation. By integrating retrieval-augmented generation (RAG) conditioned on low-uncertainty objects, the method enhances semantic accuracy. Additionally, a dynamic downsampling mapping strategy is introduced to accelerate cross-image object aggregation. Evaluated on the Replica dataset, the approach significantly improves node description accuracy while reducing mapping time by approximately two-thirds, achieving more precise and efficient 3D scene graph construction.
📝 Abstract
Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.