🤖 AI Summary
This work systematically investigates the practical utility of unsupervised domain adaptation (UDA) for semantic segmentation in the era of vision foundation models (VFMs), covering both synth-to-real and real-to-real scenarios. It quantifies UDA gains on more representative and diverse benchmarks—marking the first such evaluation. Experiments reveal that UDA yields sharply diminished improvements under strong synthetic source data (+8 → +2 mIoU) and no gain with real source data. In contrast, incorporating only 1/16 labeled target samples (e.g., Cityscapes) enables UDA to achieve 85 mIoU—nearly matching state-of-the-art fully supervised performance. Methodologically, the approach integrates source-side fine-tuning, cross-domain consistency regularization, pseudo-label refinement, and sparse supervised signals. The core contribution lies in rigorously delineating the applicability boundaries of UDA under the FVM paradigm, providing critical guidance for deployment decisions in large-scale applications such as autonomous driving.
📝 Abstract
Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.