🤖 AI Summary
Existing vision-language models suffer from poor generalization in zero-shot tasks—particularly against unseen classes—due to two key limitations: (1) biases introduced by image augmentations are not explicitly modeled, causing prompts to overfit augmentation artifacts; and (2) the absence of semantic guidance hinders prompt focus on intrinsic visual semantics. To address this, we propose the first prompt learning framework that *decouples augmentation bias*, jointly leveraging causal intervention and contrastive learning to explicitly separate semantic features from augmentation-related ones, while embedding a learnable prompt network. Our method requires no additional annotations and achieves significant performance gains across multiple zero-shot and cross-domain benchmarks. Notably, it demonstrates superior generalization under low-data and out-of-distribution settings. This work establishes a novel paradigm for robust, semantics-aware prompt learning in vision-language models.