🤖 AI Summary
Semi-supervised video object segmentation (SVOS) suffers from boundary ambiguity and feature confusion under occlusion, object interactions, and high inter-object feature similarity. To address these challenges, this paper proposes OASIS, an intrinsic structure-optimized boundary refinement framework. OASIS innovatively integrates Canny edge priors with memory-augmented features to construct a lightweight object-level structural graph, and incorporates evidential learning for uncertainty modeling—thereby enhancing boundary representation and enabling robust segmentation of occluded regions. Balancing accuracy and efficiency, OASIS achieves state-of-the-art performance on DAVIS-17 and YouTube-VOS 2019 validation sets, attaining 91.6 (F-score) and 86.6 (G-score), respectively—outperforming existing methods by a significant margin—while maintaining real-time inference at 48 FPS.
📝 Abstract
Given an object mask, Semi-supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory-based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real-time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object-level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state-of-the-art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS-17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 FPS on DAVIS.