Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot large language model (LLM) approaches for 3D scene understanding suffer from low accuracy and poor efficiency. To address this, we propose a training-free hierarchical 3D scene understanding framework that operates solely on sparse RGB views. Our method constructs a plane-augmented scene graph—using dominant planar surfaces as spatial anchors—and introduces a task-adaptive dynamic subgraph extraction mechanism to enable open-vocabulary hierarchical parsing and robust reasoning. Technically, it integrates pre-trained LLMs with multi-view geometric awareness, spatial relation modeling, and dynamic context filtering. On Space3D-Bench, our method achieves a 28.7% improvement in EM@1 and 78.2% inference speedup; on ScanQA, it matches supervised methods, demonstrating strong generalization and robustness. Key innovations include a plane-guided hierarchical graph structure and a task-driven lightweight inference paradigm.

Technology Category

Application Category

📝 Abstract
Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D scene understanding accuracy and efficiency without training
Leveraging sparse RGB views for hierarchical scene parsing and reasoning
Reducing contextual noise in open-ended 3D scene interpretation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using pre-trained LLMs
Hierarchical plane-enhanced scene graph with spatial anchors
Task-adaptive subgraph extraction reducing contextual noise
H
Haida Feng
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
H
Hao Wei
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
Z
Zewen Xu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
Haolin Wang
Haolin Wang
Ph.D. Student. Georgia Institute of Technology
infrastructure monitoringasset managementAIMLcomputer vision
C
Chade Li
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
Y
Yihong Wu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.