🤖 AI Summary
Existing 3D Gaussian splatting segmentation methods neglect semantic guidance, leading to inconsistent 2D masks across views and blurry object boundaries. To address this, we propose Gaussian Instance Tracking (GIT), the first framework to explicitly incorporate 2D instance segmentation consistency into 3D Gaussian optimization. GIT introduces a multi-view instance weight matrix to rectify mask inconsistency, an adaptive density control mechanism that dynamically splits or prunes ambiguous Gaussians to sharpen boundaries, and jointly leverages 3D consistency modeling and contrastive learning to support both online self-prompting and offline contrastive segmentation paradigms. Evaluated across multiple scenes, GIT significantly improves 3D semantic segmentation quality—enabling high-fidelity object extraction, hierarchical segmentation, and generation of editable 3D assets—while preserving geometric fidelity and rendering efficiency.
📝 Abstract
We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.