OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenges of insufficient exploitation of visual-language features and the absence of large-scale datasets in 3D scene graph alignment under open-world settings, specifically for frame-to-scan (F2S) and subscan-to-subscan (S2S) tasks. The authors propose a unified and efficient alignment framework that integrates visual-language, textual, geometric, and spatial contextual features through a distance-gated spatial attention encoder, a minimum-cost flow matcher, and a global scene embedding generator to achieve high-precision object matching. Key contributions include the first systematic exploration of the F2S alignment task, the introduction of open-set visual-language representations, and the construction of ScanNet-SG—a large-scale dataset comprising over 3,000 categories and 700,000 samples. Experiments demonstrate that the proposed method significantly outperforms existing approaches on both F2S and S2S tasks, with code and dataset publicly released.

📝 Abstract

Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.

Problem

Research questions and friction points this paper is trying to address.

3D scene graph alignment

frame-to-scan alignment

open-set vision-language features

large-scale dataset

object correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

scene graph alignment

vision-language features

open-world 3D perception