🤖 AI Summary
Existing semi-supervised semantic segmentation methods achieve limited performance gains on complex datasets (e.g., ADE20K, COCO) and fail to effectively leverage the rich representations of large-scale pretrained vision transformers (ViTs), such as DINOv2. To address this, we systematically introduce ViT backbones—replacing ResNet—into semi-supervised segmentation benchmarks for the first time. We propose a lightweight weak-strong consistency regularization framework that reduces training overhead, and integrate self-training with robust pseudo-label optimization to improve unlabeled data utilization. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods on both ADE20K and COCO, achieving higher accuracy with fewer parameters. This establishes a new paradigm for semi-supervised segmentation that is computationally efficient, generalizable across diverse domains, and scalable to larger models and datasets.
📝 Abstract
Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2x fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets. Code, models, and logs of all reported values, are available at https://github.com/LiheYoung/UniMatch-V2.