MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Addressing the open challenge of online zero-shot monocular 3D instance segmentation—where existing methods rely on pose-annotated RGB-D sequences and thus fail on pure monocular video streams—we propose the first end-to-end online monocular 3D instance segmentation framework. Our method introduces: (1) a spatial-semantic distillation mechanism for self-supervised query refinement; (2) a 3D query indexing memory coupled with CUT3R state-distribution tokens as cross-frame identity descriptors; and (3) joint integration of the CUT3R reconstruction foundation model (providing geometric priors) with 2D vision foundation model masks, temporal query retrieval, and cross-frame feature fusion. Evaluated on ScanNet200 and SceneNN, our approach achieves the first pure monocular online 3D segmentation, matching state-of-the-art RGB-D methods in performance while eliminating reliance on depth input or camera calibration.

Technology Category

Application Category

📝 Abstract

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.

Problem

Research questions and friction points this paper is trying to address.

Enables online monocular 3D instance segmentation without depth or pose

Transforms 2D segmentation masks into discriminative 3D queries

Achieves competitive performance with RGB-D systems using single RGB stream

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages CUT3R RFM for geometric priors from single RGB

Uses self-supervised query refinement with spatial-semantic distillation

Implements 3D query index memory for temporal consistency

🔎 Similar Papers

Gaga: Group Any Gaussians via 3D-aware Memory Bank