Segment Anything Across Shots: A Method and Benchmark

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing video object segmentation (VOS) methods primarily target single-shot videos and struggle with shot discontinuities induced by editing in multi-shot videos. This work pioneers the systematic study of multi-shot semi-supervised VOS (MVOS). We propose Transition Modeling and Augmentation (TMA), a novel framework comprising shot-transition modeling, transition-aware data augmentation, Transformer-based temporal modeling, and contrastive learning-driven cross-shot feature alignment—enabling generalization to multi-shot scenarios using only single-shot training data. To support this research, we introduce Cut-VOS, a new benchmark featuring dense annotations and frequent shot transitions. Our SAAS model achieves state-of-the-art performance on both YouMVOS and Cut-VOS, significantly improving accuracy and stability in cross-shot segmentation. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.

Problem

Research questions and friction points this paper is trying to address.

Segmenting target objects across multiple video shots with initial mask guidance

Addressing shot discontinuities that limit real-world video segmentation applications

Solving annotated multi-shot video data sparsity through transition mimicking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transition mimicking augmentation for cross-shot generalization

SAAS model detects and comprehends shot transitions

Cut-VOS benchmark with dense annotations and transitions

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

AI Resident - Learning From Videos (LFV)

Toyota Research Institute

Los Altos, CA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)