🤖 AI Summary
To address the challenges of object segmentation in videos exhibiting complex motion and long temporal durations, this paper proposes a pseudo-label-driven adaptive multi-model collaboration framework. Our method establishes a dual-model baseline comprising a fine-tuned SAM2 and an unsupervised video model (TMO), which jointly generate high-quality pseudo-labels during inference. An adaptive model selection mechanism then dynamically assigns the optimal model to each video segment based on these pseudo-labels. Crucially, this is the first approach to enable online, annotation-free model switching, substantially improving segmentation robustness and cross-scenario generalization. Evaluated on the 2025 PVUW MOSE challenge test set, our method achieves a new state-of-the-art J&F score of 87.26%, securing first place in the competition.
📝 Abstract
Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the"imaplus"team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.