π€ AI Summary
This work addresses the rigid registration challenge of semantic scene graphs in multi-agent and cross-temporal scenarios, where conventional handcrafted descriptors suffer from poor generalizability and existing learning-based methods rely heavily on ground-truth annotations. To overcome these limitations, we propose a novel annotation-free data generation paradigm leveraging vision foundation models (VFMs) to reconstruct semantic scene graphs. We design a compact node representation integrating open-vocabulary semantics, spatial topology, and geometric shape priors. Furthermore, we introduce a multimodal graph neural network architecture that jointly performs coarse-to-fine matching, robust pose estimation, and sparse hierarchical scene representation. Evaluated on a dual-agent SLAM benchmark, our method achieves significantly higher registration success rates than handcrafted feature-based approaches and marginally outperforms visual loop-closure networks in recall. With only 52 KB per-frame communication bandwidth, it demonstrates superior efficiency, generalizability, and practical applicability.
π Abstract
This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.