🤖 AI Summary
To address the challenge of point-level semantic localization for open-vocabulary queries in 3D Gaussian representations, this paper proposes the first two-stage query framework that constructs semantically consistent distilled ground truth using SAM2 masklets: (1) coarse-grained semantic region retrieval via language-vision alignment, followed by (2) precise localization of individual Gaussian ellipsoids. Our method integrates Segment Anything Model 2 (SAM2), cross-modal distillation, and 3D Gaussian Splatting to enable text-driven, point-level semantic selection over 3D Gaussians—previously unattainable. Evaluated on three benchmarks including 3D-OVS, our approach significantly outperforms state-of-the-art methods, achieving a +20.42% improvement in mean Intersection-over-Union (mIoU). This demonstrates the effectiveness of both our semantic ground-truth construction strategy and the staged retrieval paradigm.
📝 Abstract
Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query. Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings. More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians. In this work, we propose a point-level querying method that builds upon LangSplat's framework. Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset.