Learning Spatial-Semantic Features for Robust Video Object Segmentation

📅 2024-07-10

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

173K/year

🤖 AI Summary

Robust segmentation of multiple similar objects—especially those with complex articulated parts—in long videos remains challenging due to occlusion, time-varying appearance and environmental conditions, and cluttered backgrounds, leading to tracking drift and part-level ambiguity. To address this, we propose a spatial-semantic joint modeling framework. Our key contributions are: (1) a novel spatial-semantic module that jointly captures global semantics and local spatial dependencies; (2) a mask-cross-attention mechanism to generate discriminative object queries; and (3) a mask-guided query propagation strategy to enhance long-term consistency and fine-grained part discrimination. The framework is end-to-end differentiable and trained on large-scale video segmentation benchmarks. It achieves state-of-the-art performance on DAVIS2017 (87.8% J&F), YouTube-VOS2019 (88.1% J&F), MOSE val (74.0% J&F), and LVOS test (73.0% J&F). Code and models are publicly available.

Technology Category

Application Category

📝 Abstract

Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. Extensive experimental results show that the proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS2017 test ( extbf{87.8%}), YoutubeVOS 2019 ( extbf{88.1%}), MOSE val ( extbf{74.0%}), and LVOS test ( extbf{73.0%}), and demonstrate the effectiveness and generalization capacity of our model. The source code and trained models are released at href{https://github.com/yahooo-m/S3}{https://github.com/yahooo-m/S3}.

Problem

Research questions and friction points this paper is trying to address.

Tracking and segmenting multiple similar objects in long videos

Addressing ambiguity in identifying target components

Overcoming occlusion and appearance changes over time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-semantic block for target representation

Masked cross-attention for discriminative queries

Robust long-term query propagation

🔎 Similar Papers

No similar papers found.