🤖 AI Summary
In unsupervised domain adaptive semantic segmentation, source-domain bias induces noisy pseudo-labels, while existing methods struggle to model complex inter-object spatial relationships. To address these challenges, this paper proposes a vision-language model (VLM)-based cross-modal alignment framework. Our method leverages VLMs to generate structured scene descriptions (e.g., “pedestrian on sidewalk”)—explicitly encoding spatial relations—and uses textual semantics as a bridge to align holistic visual features with scene-level meaning, overcoming limitations of mask- or prompt-driven approaches. We integrate cross-modal contrastive learning with unsupervised domain adaptation training. Evaluated on three DASS benchmarks, our method achieves state-of-the-art performance, improving mean Intersection-over-Union by 2.6%, 1.4%, and 3.9%, respectively. It significantly mitigates pseudo-label noise and reduces inter-domain discrepancies in spatial layout distributions.
📝 Abstract
Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g."a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g."a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.