GMOS: Grounding Moving Object Segmentation in 3D Space and Time

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of existing moving object segmentation methods, which rely on 2D modalities lacking 3D geometric cues and treat motion as a sequence-level property while neglecting instantaneous states. To overcome these issues, we propose GMOS, the first framework to model multi-object motion segmentation in 3D spatiotemporal space using only RGB video input, enabling temporally fine-grained and 3D-aware segmentation. Our key contributions include: introducing GMOS-2K, the first large-scale real-world dataset supporting fine-grained temporal evaluation, along with the MOS-I benchmark protocol; designing an end-to-end 3D-aware network and its lightweight variant GMOS-S; and incorporating an online inference mechanism for streaming deployment. Experiments demonstrate that GMOS achieves state-of-the-art performance on MOS, MOS-I, and unsupervised VOS benchmarks, with significantly faster inference speed than existing approaches.

📝 Abstract

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

Problem

Research questions and friction points this paper is trying to address.

Moving Object Segmentation

3D grounding

temporal fine-grained motion

instantaneous motion state

RGB video

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware segmentation

instantaneous motion modeling

moving object segmentation