LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In unsupervised domain adaptive semantic segmentation, source-domain bias induces noisy pseudo-labels, while existing methods struggle to model complex inter-object spatial relationships. To address these challenges, this paper proposes a vision-language model (VLM)-based cross-modal alignment framework. Our method leverages VLMs to generate structured scene descriptions (e.g., “pedestrian on sidewalk”)—explicitly encoding spatial relations—and uses textual semantics as a bridge to align holistic visual features with scene-level meaning, overcoming limitations of mask- or prompt-driven approaches. We integrate cross-modal contrastive learning with unsupervised domain adaptation training. Evaluated on three DASS benchmarks, our method achieves state-of-the-art performance, improving mean Intersection-over-Union by 2.6%, 1.4%, and 3.9%, respectively. It significantly mitigates pseudo-label noise and reduces inter-domain discrepancies in spatial layout distributions.

Technology Category

Application Category

📝 Abstract
Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g."a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g."a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.
Problem

Research questions and friction points this paper is trying to address.

Improves domain adaptation for semantic segmentation
Addresses noisy pseudo-labels in vision-only methods
Enhances spatial object relationships via language context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VLM-generated scene descriptions for context
Aligns image features with text representations
Learns generalized representations via text
🔎 Similar Papers
No similar papers found.