DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation

๐Ÿ“… 2025-07-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Unsupervised video object segmentation (VOS) suffers from the scarcity of ground-truth optical flow annotations and the limited performance of conventional two-stream methods. Method: This paper proposes DepthFlowโ€”the first approach to uncover the strong structural correlation between depth and optical flow for salient objects. DepthFlow estimates per-frame depth from RGB input and synthesizes high-fidelity, structure-preserving optical flow via a geometrically informed flow-field transformation, thereby extending image-mask pairs into image-flow-mask triplets. The method employs an end-to-end trainable encoder-decoder architecture without requiring real optical flow supervision. Contribution/Results: DepthFlow achieves state-of-the-art performance across all major unsupervised VOS benchmarks, significantly outperforming existing two-stream approaches. Extensive experiments validate the effectiveness, generalizability, and practicality of depth-guided optical flow synthesis for unsupervised VOS.

Technology Category

Application Category

๐Ÿ“ Abstract
Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in unsupervised video object segmentation
Synthesizing optical flow from single images using depth
Improving VOS performance with structural depth-flow correlations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes optical flow from single images
Converts depth maps into synthetic flow fields
Expands training data with image-flow-mask pairs
๐Ÿ”Ž Similar Papers
No similar papers found.