CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt-based CLIP methods suffer from two key limitations: inaccurate textual descriptions and insufficient vision-language modality alignment, which hinder zero-shot transfer and out-of-distribution (OOD) generalization. To address these issues, we propose Conditional Domain Prompt Learning (CDPL). CDPL incorporates domain priors to construct conditional text prompts and introduces a lightweight domain meta-network that dynamically generates instance- and domain-aware prompt tokens—fully preserving the frozen pre-trained CLIP backbone. By enhancing cross-modal alignment without fine-tuning the visual encoder, CDPL significantly improves robustness to distribution shifts. Evaluated on four standard OOD benchmarks—PACS, VLCS, OfficeHome, and DigitDG—CDPL consistently outperforms state-of-the-art prompt learning and domain generalization methods, demonstrating superior effectiveness and generalization capability across diverse domain-shift scenarios.

Technology Category

Application Category

📝 Abstract
Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive performance, the prompt-based CLIP methods still suffer from: i) inaccurate text descriptions, which leads to degraded accuracy and robustness, and poses a challenge for zero-shot CLIP methods. ii) limited vision-language embedding alignment, which significantly affects the generalization performance. To tackle the above issues, this paper proposes a novel Conditional Domain prompt Learning (CoDoL) method, which utilizes readily-available domain information to form prompts and improves the vision-language embedding alignment for improving OOD generalization. To capture both instance-specific and domain-specific information, we further propose a lightweight Domain Meta Network (DMN) to generate input-conditional tokens for images in each domain. Extensive experiments on four OOD benchmarks (PACS, VLCS, OfficeHome and DigitDG) validate the effectiveness of our proposed CoDoL in terms of improving the vision-language embedding alignment as well as the out-of-distribution generalization performance.
Problem

Research questions and friction points this paper is trying to address.

Improves vision-language embedding alignment for OOD generalization
Generates domain-conditional prompts using lightweight meta network
Addresses inaccurate text descriptions in zero-shot CLIP methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional domain prompts using domain information
Domain Meta Network generating input-conditional tokens
Improving vision-language embedding alignment generalization
🔎 Similar Papers
No similar papers found.
M
Min Zhang
East China Normal University
B
Bo Jiang
East China Normal University
J
Jie Zhou
East China Normal University
Yimeng Liu
Yimeng Liu
University of California, Santa Barbara
Human-Computer InteractionHuman-AI InteractionHuman-Centered AI
X
Xin Lin
East China Normal University