π€ AI Summary
Existing approaches to 3D scene graph construction rely on depth sensors, making it challenging to achieve scalable, dense, and semantically rich scene understanding from monocular RGB images alone. This work proposes the first system capable of constructing open-vocabulary, dense, and scalable 3D scene graphs using only monocular RGB input. The method leverages open-vocabulary foundation models for room-level semantic segmentation, performs feedforward dense reconstruction once a room is fully observed, and globally aligns local maps through room-level factor graph optimization, while simultaneously enabling open-vocabulary object segmentation and tracking. Experiments demonstrate that the proposed approach outperforms existing feedforward SLAM methods in trajectory estimation and dense reconstruction accuracy on both Habitat-Matterport 3D and a newly collected office dataset, while achieving competitive performance in open-vocabulary segmentation.
π Abstract
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.