IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Indoor scene understanding for open-world robotic tasks suffers from poor instance-level modeling robustness and limited semantic generalization. To address this, we propose an efficient and robust method for constructing instance-level 3D scene graphs. Our approach introduces a novel room-prior-guided parallel instance fusion mechanism that tightly integrates LiDAR geometric priors with multi-level visual foundation models (CLIP, SAM, SA3D) for semantic understanding. It is the first to enable vision foundation model (VFM)-driven open-vocabulary recognition and language-queryable 3D scene graph construction, supporting end-to-end language-guided navigation. The framework incorporates LiDAR-camera cross-modal fusion, room-level geometric segmentation, and joint semantic-geometric optimization. Experiments demonstrate a 10× speedup in graph construction, state-of-the-art semantic accuracy, and consistent effectiveness in both simulation and real-world environments—particularly for language-instruction-driven navigation tasks.

Technology Category

Application Category

📝 Abstract
Indoor scene understanding remains a fundamental challenge in robotics, with direct implications for downstream tasks such as navigation and manipulation. Traditional approaches often rely on closed-set recognition or loop closure, limiting their adaptability in open-world environments. With the advent of visual foundation models (VFMs), open-vocabulary recognition and natural language querying have become feasible, unlocking new possibilities for 3D scene graph construction. In this paper, we propose a robust and efficient framework for instance-level 3D scene graph construction via LiDAR-camera fusion. Leveraging LiDAR's wide field of view (FOV) and long-range sensing capabilities, we rapidly acquire room-level geometric priors. Multi-level VFMs are employed to improve the accuracy and consistency of semantic extraction. During instance fusion, room-based segmentation enables parallel processing, while the integration of geometric and semantic cues significantly enhances fusion accuracy and robustness. Compared to state-of-the-art methods, our approach achieves up to an order-of-magnitude improvement in construction speed while maintaining high semantic precision. Extensive experiments in both simulated and real-world environments validate the effectiveness of our approach. We further demonstrate its practical value through a language-guided semantic navigation task, highlighting its potential for real-world robotic applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing indoor scene understanding via LiDAR-camera fusion
Overcoming closed-set recognition limits in open-world environments
Enabling efficient 3D scene graph construction for robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

LiDAR-camera fusion for 3D scene graphs
Room prior guided geometric segmentation
Multi-level visual foundation models
🔎 Similar Papers
No similar papers found.
H
Hongming Chen
School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou, China
Yiyang Lin
Yiyang Lin
PHD at the Chinese University of Hong Kong, MEng at Tsinghua University
Ziliang Li
Ziliang Li
Sun Yat-sen University
RoboticsAerial Manipulation
Biyu Ye
Biyu Ye
Sun Yat-sen University
Aerial Robotics & Electronic Engineering
Y
Yuying Zhang
School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou, China
X
Ximin Lyu
School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou, China