MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the open challenge of online zero-shot monocular 3D instance segmentation—where existing methods rely on pose-annotated RGB-D sequences and thus fail on pure monocular video streams—we propose the first end-to-end online monocular 3D instance segmentation framework. Our method introduces: (1) a spatial-semantic distillation mechanism for self-supervised query refinement; (2) a 3D query indexing memory coupled with CUT3R state-distribution tokens as cross-frame identity descriptors; and (3) joint integration of the CUT3R reconstruction foundation model (providing geometric priors) with 2D vision foundation model masks, temporal query retrieval, and cross-frame feature fusion. Evaluated on ScanNet200 and SceneNN, our approach achieves the first pure monocular online 3D segmentation, matching state-of-the-art RGB-D methods in performance while eliminating reliance on depth input or camera calibration.

Technology Category

Application Category

📝 Abstract
In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.
Problem

Research questions and friction points this paper is trying to address.

Enables online monocular 3D instance segmentation without depth or pose
Transforms 2D segmentation masks into discriminative 3D queries
Achieves competitive performance with RGB-D systems using single RGB stream
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages CUT3R RFM for geometric priors from single RGB
Uses self-supervised query refinement with spatial-semantic distillation
Implements 3D query index memory for temporal consistency
🔎 Similar Papers
No similar papers found.