🤖 AI Summary
Existing functional 3D scene graphs struggle to model densely arranged small objects on tabletops and their multi-level functional relationships, leading to instance ambiguity, visually unanchored relation reasoning, and attribute uncertainty under dynamic viewpoints. This work proposes a joint framework that integrates open-vocabulary 2D visual grounding with 3D graph optimization, introducing for the first time explicit modeling of dense tabletop objects and hierarchical functional relations. By leveraging cross-frame multi-cue node association, temporal graph optimization based on evidence accumulation and entropy regularization, and global hierarchical structure recovery, the method effectively fuses fine-grained 2D visual evidence with 3D temporal information. Evaluated in real-world complex indoor environments, it significantly enhances the understanding of functional relationships among small-scale, densely packed, and visually similar objects, enabling robust and hierarchically structured functional 3D scene graph construction.
📝 Abstract
Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.