Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address inaccurate pseudo-labeling in unsupervised domain adaptation (UDA) caused by target-domain visual embedding shifts in multimodal pre-trained models (e.g., CLIP), this paper proposes a structure-aware prompt learning framework. Methodologically, it introduces optimal transport theory into prompt learning for the first time, explicitly preserving the source-domain class-cluster geometry in text embeddings; and designs a source–target visual embedding relation-driven reference prediction mechanism to enhance pseudo-label reliability. The approach jointly integrates clustering regularization, self-training, and cross-modal alignment. It achieves state-of-the-art performance on standard UDA benchmarks. Ablation studies demonstrate that the proposed clustering constraint significantly improves pseudo-label accuracy (+4.2%) and inter-domain vision–language alignment quality, thereby enhancing the robustness of prompt representations.

Technology Category

Application Category

📝 Abstract

Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains can deviate from the visual embedding distribution in the pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforce these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we transform this insight into a novel strategy to enforce the clustering property in text embeddings, further enhancing the alignment in the target domain. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.

Problem

Research questions and friction points this paper is trying to address.

Preserve cluster structure in prompt learning for domain adaptation

Address deviation of visual embeddings in target domains

Enhance pseudo-labels using geometry of multimodal embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage source-target visual embedding relationships

Exploit clustering in visual-text embeddings

Enforce clustering via optimal transport theory

🔎 Similar Papers

No similar papers found.