🤖 AI Summary
Existing language-guided 3D Gaussian Splatting (LGS) methods are evaluated on limited scenes and near-training-view 2D renderings, failing to reflect genuine 3D understanding. To address this, we introduce GaussianWorld—the first large-scale, language-aware 3D benchmark for LGS evaluation—comprising 1,060 diverse indoor and outdoor scenes. We propose a novel language-Gaussian splatting joint 3D spatial evaluation paradigm, systematically assessing three method categories: scene-optimized, optimization-free, and generalizable approaches. We release GaussianWorld-49K, a high-quality 3DGS dataset with rich language annotations. Our evaluation demonstrates that generalizable LGS methods achieve zero-shot cross-scene inference and significantly outperform prior methods on 3D semantic segmentation. Furthermore, we identify a critical bottleneck in current LGS methods: degraded 3D comprehension under distant viewpoints. All code, data, and evaluation tools are publicly released.
📝 Abstract
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.