🤖 AI Summary
This study systematically investigates the boundaries of CLIP’s generalization capabilities under two challenging settings: domain generalization (to entirely unseen domains) and compositional generalization (to unseen category compositions). We construct controllably diverse training data distributions to isolate key factors influencing generalization. Our empirical analysis—first of its kind—reveals that domain diversity is a critical prerequisite for improving both types of generalization; compositional generalization is markedly weaker than domain generalization and more sensitive to training subset quality; and CLIP develops cross-modal shared representations as early as intermediate layers, with circuit-level architectures supporting both cross-domain and cross-category transfer. Key novel findings include: suboptimal training distributions selectively impair compositional generalization, and the synergy between intermediate-layer representation sharing and interpretable circuit mechanisms forms the foundation of robust generalization. These results provide theoretical insights and empirical evidence for designing robust vision-language models.
📝 Abstract
The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric and mechanistic analyses, we find that successful generalization requires learning of shared representations already in intermediate layers and shared circuitry.