Unified modality separation: A vision-language framework for unsupervised domain adaptation

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

In unsupervised domain adaptation (UDA), vision-language models (VLMs) suffer from limited transfer of modality-invariant knowledge and degraded target-domain performance due to inherent modality gaps. To address this, we propose a Unified Modality Separation (UMS) framework: (1) we introduce, for the first time, a modality discrepancy measurement mechanism to dynamically decouple modality-specific and modality-invariant features; (2) we employ uncertainty-guided sample selection for pseudo-labeling enhancement and design a test-time adaptive feature fusion strategy; and (3) we integrate prompt tuning, embedding alignment, feature disentanglement, and modality-adaptive weight learning. Evaluated across multiple backbones, benchmarks, and UDA settings, UMS achieves an average accuracy gain of 9%, reduces computational overhead by 9×, and significantly improves cross-domain generalization and inference efficiency.

Technology Category

Application Category

📝 Abstract

Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.

Problem

Research questions and friction points this paper is trying to address.

Addresses modality gap in vision-language UDA frameworks

Proposes unified separation of modality-specific and invariant components

Enhances cross-modal alignment with adaptive ensemble weights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified modality separation framework

Modality-adaptive ensemble weights

Modality discrepancy metric design

🔎 Similar Papers

No similar papers found.