🤖 AI Summary
This work addresses the semantic inconsistency in unsupervised 3D point cloud segmentation caused by modality mismatch between 2D and 3D representations. To bridge this gap, it introduces 3D Gaussian splatting as a unified intermediate representation, reconstructing sparse point clouds into a dense Gaussian space. Multi-view rendering is employed to generate 2D views, from which semantic masks are extracted using SAM. Contrastive learning then distills this 2D semantic knowledge into the 3D Gaussian primitives. Finally, a two-stage registration followed by nearest-neighbor search propagates the learned semantics back to the original point cloud. This approach effectively mitigates the domain gap between discrete point clouds and continuous images, resolves semantic ambiguity due to projection overlaps, and enforces cross-view consistency. It achieves state-of-the-art performance among unsupervised methods, surpassing prior best results by 0.9% and 2.8% mIoU on ScanNet-V2 and S3DIS, respectively.
📝 Abstract
Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.