Segment Anything Across Shots: A Method and Benchmark

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video object segmentation (VOS) methods primarily target single-shot videos and struggle with shot discontinuities induced by editing in multi-shot videos. This work pioneers the systematic study of multi-shot semi-supervised VOS (MVOS). We propose Transition Modeling and Augmentation (TMA), a novel framework comprising shot-transition modeling, transition-aware data augmentation, Transformer-based temporal modeling, and contrastive learning-driven cross-shot feature alignment—enabling generalization to multi-shot scenarios using only single-shot training data. To support this research, we introduce Cut-VOS, a new benchmark featuring dense annotations and frequent shot transitions. Our SAAS model achieves state-of-the-art performance on both YouMVOS and Cut-VOS, significantly improving accuracy and stability in cross-shot segmentation. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.
Problem

Research questions and friction points this paper is trying to address.

Segmenting target objects across multiple video shots with initial mask guidance
Addressing shot discontinuities that limit real-world video segmentation applications
Solving annotated multi-shot video data sparsity through transition mimicking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transition mimicking augmentation for cross-shot generalization
SAAS model detects and comprehends shot transitions
Cut-VOS benchmark with dense annotations and transitions
🔎 Similar Papers
No similar papers found.