🤖 AI Summary
Existing video object segmentation (VOS) methods primarily target single-shot videos and struggle with shot discontinuities induced by editing in multi-shot videos. This work pioneers the systematic study of multi-shot semi-supervised VOS (MVOS). We propose Transition Modeling and Augmentation (TMA), a novel framework comprising shot-transition modeling, transition-aware data augmentation, Transformer-based temporal modeling, and contrastive learning-driven cross-shot feature alignment—enabling generalization to multi-shot scenarios using only single-shot training data. To support this research, we introduce Cut-VOS, a new benchmark featuring dense annotations and frequent shot transitions. Our SAAS model achieves state-of-the-art performance on both YouMVOS and Cut-VOS, significantly improving accuracy and stability in cross-shot segmentation. All code and datasets are publicly released.
📝 Abstract
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.