🤖 AI Summary
Significant object detection (SOD) faces challenges of high pixel-level annotation cost and poor cross-task generalization. To address these, we propose S3OD: (1) a large-scale synthetic dataset of 139K high-resolution images generated via multimodal diffusion models, coupled with an unsupervised, high-accuracy pseudo-labeling method that jointly leverages DINO-v3 self-supervised features and diffusion intermediate representations—marking the first such approach; (2) an ambiguity-aware multi-mask decoder that explicitly models multiple plausible interpretations of saliency; and (3) a performance-feedback-driven iterative data synthesis mechanism that dynamically prioritizes samples. Trained solely on synthetic data, S3OD reduces cross-dataset prediction error by 20–50%. After fine-tuning, it achieves state-of-the-art performance on DIS and HR-SOD benchmarks.
📝 Abstract
Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.