🤖 AI Summary
To address the challenges of time-consuming initialization and poor multi-view consistency in multi-object segmentation for Gaussian Splatting, this paper proposes PointGauss—a point cloud-guided real-time 3D instance segmentation framework. Its core innovations are: (1) a point cloud-driven Gaussian primitive decoder that efficiently generates 3D instance masks within one minute; and (2) a GPU-accelerated 2D mask rendering system ensuring high-fidelity, geometrically consistent segmentation across views. Evaluated on standard benchmarks, PointGauss achieves substantial improvements in multi-view mean Intersection-over-Union (mIoU), outperforming state-of-the-art methods by 1.89–31.78%. Furthermore, we introduce DesktopObjects-360—the first large-scale, 360°-oriented benchmark dataset for 3D instance segmentation—designed to support fine-grained, omnidirectional evaluation and thereby fill a critical gap in existing 3D segmentation resources.
📝 Abstract
We introduce PointGauss, a novel point cloud-guided framework for real-time multi-object segmentation in Gaussian Splatting representations. Unlike existing methods that suffer from prolonged initialization and limited multi-view consistency, our approach achieves efficient 3D segmentation by directly parsing Gaussian primitives through a point cloud segmentation-driven pipeline. The key innovation lies in two aspects: (1) a point cloud-based Gaussian primitive decoder that generates 3D instance masks within 1 minute, and (2) a GPU-accelerated 2D mask rendering system that ensures multi-view consistency. Extensive experiments demonstrate significant improvements over previous state-of-the-art methods, achieving performance gains of 1.89 to 31.78% in multi-view mIoU, while maintaining superior computational efficiency. To address the limitations of current benchmarks (single-object focus, inconsistent 3D evaluation, small scale, and partial coverage), we present DesktopObjects-360, a novel comprehensive dataset for 3D segmentation in radiance fields, featuring: (1) complex multi-object scenes, (2) globally consistent 2D annotations, (3) large-scale training data (over 27 thousand 2D masks), (4) full 360° coverage, and (5) 3D evaluation masks.