RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing methods for 3D scene graph generation rely heavily on depth sensors, limiting their deployment in RGB-only settings and lacking semantic-driven active exploration mechanisms. This work proposes the first RGB-only, active, incremental 3D scene graph construction framework that unifies perception and planning by jointly leveraging object semantics, geometry, relational context, and multi-view information, while supporting collaborative mapping between mobile robots and static cameras. Key innovations include a semantics-guided viewpoint selection strategy and a heterogeneous RGB camera fusion mechanism for joint mapping. Experiments demonstrate that the approach achieves F1 scores on Replica comparable to those of oracle depth-based baselines; on ReplicaCAD, its semantic-driven exploration discovers twice as many objects as geometry-frontier baselines, and external static cameras significantly enhance scene understanding at no additional sensing cost.

📝 Abstract

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

Problem

Research questions and friction points this paper is trying to address.

3D scene graph

RGB-only

active exploration

mobile robots

viewpoint selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-only

active exploration

3D scene graph