OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

📅 2024-08-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary 3D scene understanding (OV-3D) methods focus solely on object category recognition, failing to evaluate models’ comprehension of fine-grained semantic attributes. Method: We propose generalized open-vocabulary 3D scene understanding (GOV-3D), a novel task that extends open-vocabulary reasoning to eight linguistic dimensions—function, material, color, shape, size, orientation, spatial relation, and state—encompassing abstract, object-specific, and fine-grained attribute queries. To support this, we introduce OpenScan, the first 3D benchmark for attribute-level language understanding, built upon real-world ScanNet and Structured3D scans and annotated with a cross-dimensional attribute taxonomy. Contribution/Results: Extensive experiments reveal that state-of-the-art OV-3D models achieve only <32% average accuracy on GOV-3D, exposing critical deficiencies in semantic disentanglement and attribute-aware representation learning—thereby establishing a new evaluation standard and identifying key research directions for future work.

Technology Category

Application Category

📝 Abstract
Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed set of object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient in providing a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named extit{OpenScan}, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, and material. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed simply by scaling up object classes during training. We highlight the limitations of existing methodologies and explore promising directions to overcome the identified shortcomings.
Problem

Research questions and friction points this paper is trying to address.

Extends 3D scene understanding beyond predefined object classes.
Introduces generalized open-vocabulary 3D scene understanding (GOV-3D).
Evaluates models on fine-grained, object-specific attributes via OpenScan.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Generalized Open-Vocabulary 3D Scene Understanding
Develops OpenScan benchmark with diverse 3D attributes
Evaluates state-of-the-art methods on abstract vocabularies
🔎 Similar Papers
No similar papers found.
Youjun Zhao
Youjun Zhao
City University of Hong Kong
Computer VisionMachine Learning
Jiaying Lin
Jiaying Lin
Peking University
Computer VisionMultimodal
S
Shuquan Ye
City University of Hong Kong
Q
Qianshi Pang
South China University of Technology
R
Rynson W. H. Lau
City University of Hong Kong