🤖 AI Summary
This work addresses the “English performance degradation” problem—termed the “multilingual curse”—prevalent in multilingual CLIP models. We propose a general, translation-free, architecture-agnostic multilingual contrastive training paradigm. Leveraging web-scale multilingual image-text pairs, we train the first truly end-to-end multilingual CLIP model from scratch, integrating large-scale data denoising and cross-lingual balancing strategies, with joint optimization over a Vision Transformer backbone. Our model achieves a 0.8% accuracy gain over monolingual English CLIP on zero-shot ImageNet classification and attains 64.3% accuracy on multilingual retrieval benchmarks—including CVQA, Babel-ImageNet, and XM3600—marking the first instance where a multilingual model comprehensively outperforms its monolingual counterpart. This result substantively breaks the multilingual curse, demonstrating that multilingual capability need not compromise English performance.
📝 Abstract
Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.