OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
This work addresses the challenges of insufficient exploitation of visual-language features and the absence of large-scale datasets in 3D scene graph alignment under open-world settings, specifically for frame-to-scan (F2S) and subscan-to-subscan (S2S) tasks. The authors propose a unified and efficient alignment framework that integrates visual-language, textual, geometric, and spatial contextual features through a distance-gated spatial attention encoder, a minimum-cost flow matcher, and a global scene embedding generator to achieve high-precision object matching. Key contributions include the first systematic exploration of the F2S alignment task, the introduction of open-set visual-language representations, and the construction of ScanNet-SG—a large-scale dataset comprising over 3,000 categories and 700,000 samples. Experiments demonstrate that the proposed method significantly outperforms existing approaches on both F2S and S2S tasks, with code and dataset publicly released.
📝 Abstract
Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.
Problem

Research questions and friction points this paper is trying to address.

3D scene graph alignment
frame-to-scan alignment
open-set vision-language features
large-scale dataset
object correspondence
Innovation

Methods, ideas, or system contributions that make the work stand out.

scene graph alignment
vision-language features
open-world 3D perception
minimum-cost flow allocation
large-scale dataset
🔎 Similar Papers
No similar papers found.