TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection

πŸ“… 2026-04-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing co-salient object detection (CoSOD) methods, which often rely on training and suffer from restricted generalization. To overcome these issues, we propose the first training-free CoSOD framework that synergistically leverages two foundational vision models: SAM and DINO. Our approach employs SAM to generate high-quality candidate masks and utilizes DINO’s attention maps to construct an intra-image saliency filter. Furthermore, we introduce a cross-image prototype selection mechanism to accurately identify co-salient objects. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art techniques across multiple benchmarks, achieving a 13.7% improvement over the latest training-free approach while exhibiting superior generalization capability and detection accuracy.
πŸ“ Abstract
Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7\% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.
Problem

Research questions and friction points this paper is trying to address.

Co-salient Object Detection
Vision Foundation Models
Generalization
Training-free
Saliency Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free
Co-salient Object Detection
Vision Foundation Models
SAM-DINO Synergy
Mask Filtering
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhijin He
XJTLU
S
Shuo Jin
XJTLU, University of Liverpool
S
Siyue Yu
XJTLU
S
Shuwei Wu
XJTLU
B
Bingfeng Zhang
China University of Petroleum (East China)
Li Yu
Li Yu
Associate Professor, Nanjing University of information science & technology
video compressionvideo streamingmultimedia
Jimin Xiao
Jimin Xiao
Professor in Intelligent Science, Xi'an Jiaotong-Liverpool University
computer visionmachine learning