Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing zero-shot large language model (LLM) approaches for 3D scene understanding suffer from low accuracy and poor efficiency. To address this, we propose a training-free hierarchical 3D scene understanding framework that operates solely on sparse RGB views. Our method constructs a plane-augmented scene graph—using dominant planar surfaces as spatial anchors—and introduces a task-adaptive dynamic subgraph extraction mechanism to enable open-vocabulary hierarchical parsing and robust reasoning. Technically, it integrates pre-trained LLMs with multi-view geometric awareness, spatial relation modeling, and dynamic context filtering. On Space3D-Bench, our method achieves a 28.7% improvement in EM@1 and 78.2% inference speedup; on ScanQA, it matches supervised methods, demonstrating strong generalization and robustness. Key innovations include a plane-guided hierarchical graph structure and a task-driven lightweight inference paradigm.

Technology Category

Application Category

📝 Abstract

Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.

Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D scene understanding accuracy and efficiency without training

Leveraging sparse RGB views for hierarchical scene parsing and reasoning

Reducing contextual noise in open-ended 3D scene interpretation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using pre-trained LLMs

Hierarchical plane-enhanced scene graph with spatial anchors

Task-adaptive subgraph extraction reducing contextual noise

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View

2024-04-04arXiv.orgCitations: 3

Authors to Follow