Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-world 3D scene understanding is hindered by closed-vocabulary supervision and static annotations, limiting dynamic and long-tailed semantic reasoning. To address this, we propose the first unified framework integrating vision-language models (VLMs) with retrieval-augmented generation (RAG) for category-agnostic dynamic scene graph generation and multimodal interactive reasoning. Our key contributions are: (1) a 3D-aware open-vocabulary scene graph generation mechanism that grounds semantics directly in geometric and visual features; and (2) a vector-database-backed cross-modal RAG pipeline enabling language-guided object localization, relational inference, and task planning. Evaluated on 3DSSG and Replica, our method achieves significant improvements over state-of-the-art approaches across four tasks—scene question answering, visual grounding, instance retrieval, and task planning—demonstrating strong generalization and scalability to unseen categories and complex interactions.

Technology Category

Application Category

📝 Abstract
Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Developing open-world 3D scene understanding beyond closed-vocabulary limitations
Creating dynamic scene graphs without fixed label sets for objects and relationships
Enabling multimodal exploration through retrieval-augmented reasoning for diverse queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Models with retrieval-based reasoning
Generates dynamic scene graphs without fixed label sets
Encodes scene graphs into vector database for queries
🔎 Similar Papers
No similar papers found.
F
Fei Yu
Liaoning University of Technology
Q
Quan Deng
University of the Chinese Academy of Sciences
S
Shengeng Tang
Hefei University of Technology
Y
Yuehua Li
Zhejiang Lab
Lechao Cheng
Lechao Cheng
Associate Professor, Hefei University of Technology
Imbalanced LearningDistillationNoisy Label LearningWeakly Supervised LearningVisual Tuning