🤖 AI Summary
To address low segmentation accuracy and severe inter-frame object fragmentation in video object segmentation (VOS) under complex real-world scenarios, this work proposes the FVOS fine-tuning framework. First, a lightweight domain-adaptive fine-tuning is applied to a pre-trained model. Second, a morphology-based post-processing module—leveraging morphological opening and closing operations—is introduced to explicitly repair intra-frame target discontinuities. Third, a multi-scale feature extraction mechanism coupled with weighted voting fusion is designed to enhance temporal consistency and robustness. Evaluated on the PVUW 2025 MOSE challenge, our method achieves J&F scores of 76.81% on the validation set and 83.92% on the test set, ranking third overall. The framework significantly mitigates segmentation degradation caused by dynamic backgrounds, occlusions, and small objects.
📝 Abstract
Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.