🤖 AI Summary
This paper addresses the unreliability of pseudo-masks in unsupervised salient object detection (SOD), stemming from the absence of pixel-level annotations. To this end, we propose AutoSOD—a fully end-to-end framework built upon a novel Split-Fuse-Transport architecture. Its core innovations include: (i) an entropy-guided dual-clustering head integrating spectral clustering and k-means; (ii) prototype optimal transport (POT) to align prototypes across multi-view feature representations; and (iii) a MaskFormer-style encoder-decoder that generates high-quality, part-aware, boundary-sharp pseudo-masks in a single forward pass—eliminating hand-crafted priors and offline voting heuristics. Evaluated on five standard benchmarks, AutoSOD achieves up to 26% and 36% absolute gains in F-measure over state-of-the-art unsupervised and weakly supervised methods, respectively, closely approaching fully supervised performance while improving both training efficiency and segmentation accuracy.
📝 Abstract
Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.