🤖 AI Summary
This work addresses the geometric information loss and semantic error propagation inherent in open-vocabulary 3D semantic segmentation methods that rely on 2D models. To mitigate these issues, the authors propose GeoGuide, a novel framework that leverages a pretrained 3D model to integrate multi-level geometric-semantic consistency. GeoGuide introduces three key components: uncertainty-aware superpoint distillation, geometry-prior-driven instance mask reconstruction, and a cross-instance relational consistency module, collectively enhancing semantic alignment under explicit 3D geometric guidance. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate that GeoGuide significantly outperforms existing approaches, confirming its effectiveness and robustness in open-vocabulary 3D semantic segmentation.
📝 Abstract
Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.