🤖 AI Summary
Addressing the open challenge of online zero-shot monocular 3D instance segmentation—where existing methods rely on pose-annotated RGB-D sequences and thus fail on pure monocular video streams—we propose the first end-to-end online monocular 3D instance segmentation framework. Our method introduces: (1) a spatial-semantic distillation mechanism for self-supervised query refinement; (2) a 3D query indexing memory coupled with CUT3R state-distribution tokens as cross-frame identity descriptors; and (3) joint integration of the CUT3R reconstruction foundation model (providing geometric priors) with 2D vision foundation model masks, temporal query retrieval, and cross-frame feature fusion. Evaluated on ScanNet200 and SceneNN, our approach achieves the first pure monocular online 3D segmentation, matching state-of-the-art RGB-D methods in performance while eliminating reliance on depth input or camera calibration.
📝 Abstract
In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.