Dual-Foundation Models for Unsupervised Domain Adaptation

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the limitations in unsupervised domain adaptive semantic segmentation, particularly the limited coverage of high-confidence pseudo-labels and the substantial bias and instability of class prototypes initialized from the source domain. To overcome these issues, the authors propose a dual foundation model framework leveraging SAM and DINOv3. By employing superpixel-guided prompting to expand the learning scope of target-domain pixels and integrating the robust features from DINOv3, the method constructs stable, domain-invariant class prototypes, thereby eliminating reliance on source-domain prototypes and high-confidence pseudo-labels. The approach achieves notable performance gains, yielding +1.3% and +1.4% improvements in mIoU on the GTA→Cityscapes and SYNTHIA→Cityscapes benchmarks, respectively, significantly outperforming current strong baselines.

📝 Abstract

Semantic segmentation provides pixel-level scene understanding essential for autonomous driving and fine-grained perception tasks. However, training segmentation models requires costly, labor-intensive annotations on real-world datasets. Unsupervised Domain Adaptation (UDA) addresses this by training models on labeled synthetic data and adapting them to unlabeled real images. While conceptually simple, adaptation is challenging due to the domain gap, i.e., differences in visual appearance and scene structure between synthetic and real data. Prior approaches bridge this gap through pixel-level mixing or feature-level contrastive learning. Yet, these techniques suffer from two major limitations: (1) reliance on high-confidence pseudo-labels restricts learning to a subset of the target domain, and (2) prototype-based contrastive methods initialize class prototypes from source-trained models, yielding biased and unstable anchors during adaptation. To address these issues, we propose a dual-foundation UDA framework that leverages two complementary foundation models. First, we employ the Segment Anything Model (SAM) with superpixel-guided prompting to enable learning from a broader range of target pixels beyond high-confidence predictions. Second, we incorporate DINOv3 to construct stable, domain-invariant class prototypes through its robust representation learning. Our method achieves consistent improvements of +1.3% and +1.4% mIoU over strong UDA baselines on GTA-to-Cityscapes and SYNTHIA-to-Cityscapes, respectively.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised Domain Adaptation

Semantic Segmentation

Domain Gap

Pseudo-labels

Class Prototypes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Foundation Models

Unsupervised Domain Adaptation

Segment Anything Model (SAM)