🤖 AI Summary
Existing 3D open-vocabulary language model evaluations are confined to closed-set semantics, failing to capture linguistic diversity and real-world scene complexity. To address this, we propose OpenLex3D—the first benchmark for open-vocabulary 3D scene understanding—comprising 23 real-world scenes synthesized from cross-source datasets (Replica, ScanNet++, and HM3D). It introduces synonym-rich category taxonomies and multi-granular natural language descriptions, supporting open-set 3D semantic segmentation and object retrieval. We formally define and implement the first 3D open-vocabulary evaluation paradigm, underpinned by a novel annotation framework grounded in linguistic diversity. Comprehensive benchmarking exposes critical limitations of state-of-the-art methods in feature discriminability, segmentation consistency, and downstream generalization. OpenLex3D is publicly released to advance standardized, rigorous progress in 3D vision-language understanding.
📝 Abstract
3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, the evaluation of these representations is limited to closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark to evaluate 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for 23 scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we provide insights on feature precision, segmentation, and downstream capabilities. We evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. The benchmark is publicly available at: https://openlex3d.github.io/.