Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance bottlenecks in unsupervised video object segmentation (UVOS) caused by motion-appearance feature imbalance and unstable optical flow quality, this paper proposes a Trunk-Collateral architecture—comprising a shared backbone and parallel branches—along with an Intrinsic Saliency-guided Mechanism (ISRM). The architecture decouples shared and distinctive motion-appearance representations via the shared trunk, while ISRM introduces the first pixel-level intrinsic saliency modeling that requires no auxiliary input, enabling adaptive motion-appearance fusion and multi-scale feature refinement. Evaluated on standard benchmarks, the method achieves 89.2% J&F on DAVIS-16, 76.0% J on YouTube-Objects, and 86.4% J on FBMS. Moreover, it consistently outperforms state-of-the-art methods across four video salient object detection (VSOD) benchmarks, demonstrating superior robustness and generalization capability.

Technology Category

Application Category

📝 Abstract
Recent unsupervised video object segmentation (UVOS) methods predominantly adopt the motion-appearance paradigm. Mainstream motion-appearance approaches use either the two-encoder structure to separately encode motion and appearance features, or the single-encoder structure for joint encoding. However, these methods fail to properly balance the motion-appearance relationship. Consequently, even with complex fusion modules for motion-appearance integration, the extracted suboptimal features degrade the models' overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high-quality segmentation results. To address these challenges, we propose the Intrinsic Saliency guided Trunk-Collateral Net}work (ISTC-Net), which better balances the motion-appearance relationship and incorporates model's intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. We propose a novel Trunk-Collateral structure. The shared trunk backbone captures the motion-appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model's intrinsic saliency information to refine high-level features, and provide pixel-level guidance for motion-appearance fusion, thereby enhancing performance without additional input. Experimental results show that ISTC-Net achieved state-of-the-art performance on three UVOS datasets (89.2% J&F on DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.
Problem

Research questions and friction points this paper is trying to address.

Balancing motion-appearance relationship in UVOS
Improving segmentation without relying solely on optical flow
Enhancing performance using intrinsic saliency information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trunk-Collateral structure balances motion-appearance features
Intrinsic Saliency Refinement Module enhances segmentation
No additional input needed for performance boost
🔎 Similar Papers
No similar papers found.
X
Xiangyu Zheng
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 200433, China
Wanyun Li
Wanyun Li
Fudan Universiry
深度学习 计算机视觉
S
Songcheng He
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 200433, China
X
Xiaoqiang Li
School of Computer Engineering and Science, Shanghai University, Shanghai 200433, China
W
We Zhang
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 200433, China