π€ AI Summary
Existing VOS benchmarks (e.g., DAVIS, YouTube-VOS) predominantly feature salient, isolated objects, failing to reflect real-world scene complexity. To address this gap, we introduce MOSEv2βthe first systematic VOS benchmark incorporating camouflaged objects, non-physical entities (e.g., shadows, reflections), cross-shot sequences, and reasoning tasks requiring external knowledge. MOSEv2 comprises 5,024 videos, 700K high-quality frames, 200 categories, and 10,074 annotated objects, built upon MOSEv1 with multi-source acquisition and meticulous manual annotation. Extensive experiments reveal severe performance degradation of state-of-the-art methods: SAM2βs mAP drops by over 25 percentage points on MOSEv2 versus MOSEv1, exposing critical limitations in handling occlusion, small objects, and low-light conditions. MOSEv2 thus establishes a rigorous evaluation platform and technical catalyst for advancing video object segmentation toward real-world robustness and generalization.
π Abstract
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.