Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses weakly supervised semantic segmentation using only image-level class labels. We propose a novel paradigm that leverages approximate relative object size distributions—as opposed to pixel-level masks—as the weak supervision signal. Our method builds upon standard segmentation architectures (e.g., DeepLab, FCN) and introduces a zero-avoiding KL divergence loss to directly align predicted size distributions with coarse-grained size annotations, either human-provided or synthetically generated, without architectural modifications or multi-stage training. We provide the first theoretical analysis and empirical validation demonstrating that coarse-grained size distribution information alone is sufficient to achieve segmentation performance approaching fully supervised accuracy. On PASCAL VOC, our approach achieves state-of-the-art weakly supervised performance, with several classes even surpassing fully supervised baselines. Moreover, it exhibits strong generalization and robustness to annotation noise on COCO and medical imaging benchmarks.

Technology Category

Application Category

📝 Abstract
This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.
Problem

Research questions and friction points this paper is trying to address.

Demonstrates accurate semantic segmentation using approximate size targets.
Introduces zero-avoiding KL-divergence loss for comparable segmentation accuracy.
Validates robustness of standard networks to size target errors.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses approximate object-size distributions for segmentation
Employs zero-avoiding KL-divergence loss for accuracy
Validates robustness with synthetic and human-annotated data
🔎 Similar Papers
No similar papers found.
X
Xingye Fan
University of Waterloo
Z
Zhongwen Zhang
University of Waterloo
Yuri Boykov
Yuri Boykov
Professor, Computer Science, University of Waterloo
computer visionbiomedical image analysis