LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

In unsupervised domain adaptive semantic segmentation, source-domain bias induces noisy pseudo-labels, while existing methods struggle to model complex inter-object spatial relationships. To address these challenges, this paper proposes a vision-language model (VLM)-based cross-modal alignment framework. Our method leverages VLMs to generate structured scene descriptions (e.g., “pedestrian on sidewalk”)—explicitly encoding spatial relations—and uses textual semantics as a bridge to align holistic visual features with scene-level meaning, overcoming limitations of mask- or prompt-driven approaches. We integrate cross-modal contrastive learning with unsupervised domain adaptation training. Evaluated on three DASS benchmarks, our method achieves state-of-the-art performance, improving mean Intersection-over-Union by 2.6%, 1.4%, and 3.9%, respectively. It significantly mitigates pseudo-label noise and reduces inter-domain discrepancies in spatial layout distributions.

Technology Category

Application Category

📝 Abstract

Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g."a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g."a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

Problem

Research questions and friction points this paper is trying to address.

Improves domain adaptation for semantic segmentation

Addresses noisy pseudo-labels in vision-only methods

Enhances spatial object relationships via language context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VLM-generated scene descriptions for context

Aligns image features with text representations

Learns generalized representations via text

🔎 Similar Papers

No similar papers found.