🤖 AI Summary
Existing unsupervised domain adaptation (UDA) methods perform well under classical domain shifts (e.g., synthetic → real) but suffer significant performance degradation under complex shifts—such as geographical shifts—due to drastic variations in background and object appearance. To address this, we propose a text-modality-robust vision-language collaborative UDA framework. Our method quantifies pseudo-label uncertainty via normalized CLIP similarity and introduces a multimodal soft contrastive learning loss, enabling dynamic weighting of classification loss and implicit modeling of positive/negative sample pairs. Furthermore, we jointly optimize pseudo-label generation, uncertainty estimation, soft contrastive learning, and cross-modal (vision-text) feature alignment. Extensive experiments demonstrate state-of-the-art performance on both DomainNet (classical shifts) and GeoNet (geographical shifts), significantly enhancing cross-domain generalization capability.
📝 Abstract
Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.