🤖 AI Summary
To address the robustness deficiency in 3D scene graph alignment caused by incomplete, noisy, and low-overlap point clouds, this paper proposes a cross-modal language-enhanced joint alignment method. The core innovation lies in constructing a unified multi-modal joint embedding space, achieved via lightweight unimodal encoders and an attention-driven point cloud–language fusion mechanism, enabling precise alignment of partially overlapping scenes across heterogeneous modalities. This design significantly enhances cross-modal generalization capability: on real-world datasets, alignment accuracy improves by 40% over state-of-the-art methods. Moreover, the approach demonstrates strong robustness and scalability in downstream tasks—including visual localization and 3D reconstruction—under challenging conditions such as occlusion, sensor noise, and sparse overlap.
📝 Abstract
Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.