🤖 AI Summary
Robust segmentation of multiple similar objects—especially those with complex articulated parts—in long videos remains challenging due to occlusion, time-varying appearance and environmental conditions, and cluttered backgrounds, leading to tracking drift and part-level ambiguity. To address this, we propose a spatial-semantic joint modeling framework. Our key contributions are: (1) a novel spatial-semantic module that jointly captures global semantics and local spatial dependencies; (2) a mask-cross-attention mechanism to generate discriminative object queries; and (3) a mask-guided query propagation strategy to enhance long-term consistency and fine-grained part discrimination. The framework is end-to-end differentiable and trained on large-scale video segmentation benchmarks. It achieves state-of-the-art performance on DAVIS2017 (87.8% J&F), YouTube-VOS2019 (88.1% J&F), MOSE val (74.0% J&F), and LVOS test (73.0% J&F). Code and models are publicly available.
📝 Abstract
Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. Extensive experimental results show that the proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS2017 test ( extbf{87.8%}), YoutubeVOS 2019 ( extbf{88.1%}), MOSE val ( extbf{74.0%}), and LVOS test ( extbf{73.0%}), and demonstrate the effectiveness and generalization capacity of our model. The source code and trained models are released at href{https://github.com/yahooo-m/S3}{https://github.com/yahooo-m/S3}.