SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the robustness deficiency in 3D scene graph alignment caused by incomplete, noisy, and low-overlap point clouds, this paper proposes a cross-modal language-enhanced joint alignment method. The core innovation lies in constructing a unified multi-modal joint embedding space, achieved via lightweight unimodal encoders and an attention-driven point cloud–language fusion mechanism, enabling precise alignment of partially overlapping scenes across heterogeneous modalities. This design significantly enhances cross-modal generalization capability: on real-world datasets, alignment accuracy improves by 40% over state-of-the-art methods. Moreover, the approach demonstrates strong robustness and scalability in downstream tasks—including visual localization and 3D reconstruction—under challenging conditions such as occlusion, sensor noise, and sparse overlap.

Technology Category

Application Category

📝 Abstract

Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.

Problem

Research questions and friction points this paper is trying to address.

Aligning partially overlapping 3D scene graphs across different modalities

Handling incomplete or noisy input in 3D scene graph alignment

Improving alignment accuracy under low-overlap conditions and sensor noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal language-aided framework for alignment

Learns unified joint embedding space for fusion

Uses lightweight encoders with attention-based fusion

🔎 Similar Papers

No similar papers found.