๐ค AI Summary
Existing prompt learning methods lack causal theoretical foundations, hindering the acquisition of causally invariant prompts that generalize across categories. To address this, we propose DiCapโa novel framework that pioneers the integration of diffusion models into counterfactual prompt generation. Grounded in causal identifiability theory, DiCap constructs minimally sufficient counterfactual samples and jointly optimizes marginal and conditional distribution gradients via contrastive learning, ensuring strict causal alignment in prompt learning. Theoretically, DiCap provides guaranteed bounds on estimation error; empirically, it significantly improves out-of-distribution generalization on unseen classes across image classification, imageโtext retrieval, and visual question answering tasks. Comprehensive experiments validate both the effectiveness and robustness of the learned causally invariant prompts.
๐ Abstract
Prompt learning has garnered attention for its efficiency over traditional model training and fine-tuning. However, existing methods, constrained by inadequate theoretical foundations, encounter difficulties in achieving causally invariant prompts, ultimately falling short of capturing robust features that generalize effectively across categories. To address these challenges, we introduce the $ extit{ extbf{DiCap}}$ model, a theoretically grounded $ extbf{Di}$ffusion-based $ extbf{C}$ounterf$ extbf{a}$ctual $ extbf{p}$rompt learning framework, which leverages a diffusion process to iteratively sample gradients from the marginal and conditional distributions of the causal model, guiding the generation of counterfactuals that satisfy the minimal sufficiency criterion. Grounded in rigorous theoretical derivations, this approach guarantees the identifiability of counterfactual outcomes while imposing strict bounds on estimation errors. We further employ a contrastive learning framework that leverages the generated counterfactuals, thereby enabling the refined extraction of prompts that are precisely aligned with the causal features of the data. Extensive experimental results demonstrate that our method performs excellently across tasks such as image classification, image-text retrieval, and visual question answering, with particularly strong advantages in unseen categories.