🤖 AI Summary
This work addresses the degradation of generalization performance in CLIP models when fine-tuned under out-of-distribution (OOD) conditions. Grounded in structural causal models (SCMs), it identifies a fundamental discrepancy between training and test environments: stable mechanisms—comprising only invariant causal factors—should remain consistent, whereas spurious, varying factors induce distributional shift. We provide the first theoretical proof that linear mappings from CLIP’s image/text embeddings to invariant causal factors are estimable, and derive a sufficient condition for low OOD risk. Building on this, we propose CLIP-ICM—the first invariant causal mechanism learning framework for vision-language models—which achieves mechanism disentanglement via intervention-based data construction, linear projection estimation, and invariant subspace prediction. Extensive evaluation on multiple standard OOD benchmarks demonstrates significant improvements in generalization, validating CLIP-ICM’s effectiveness in enhancing model robustness and reliability.
📝 Abstract
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-of-distribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIP-ICM involves collecting interventional data, estimating a linear projection matrix, and making predictions within the invariant subspace. Experiments on several OOD datasets show that CLIP-ICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications.