MetaCLIP 2: A Worldwide Scaling Recipe

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “English performance degradation” problem—termed the “multilingual curse”—prevalent in multilingual CLIP models. We propose a general, translation-free, architecture-agnostic multilingual contrastive training paradigm. Leveraging web-scale multilingual image-text pairs, we train the first truly end-to-end multilingual CLIP model from scratch, integrating large-scale data denoising and cross-lingual balancing strategies, with joint optimization over a Vision Transformer backbone. Our model achieves a 0.8% accuracy gain over monolingual English CLIP on zero-shot ImageNet classification and attains 64.3% accuracy on multilingual retrieval benchmarks—including CVQA, Babel-ImageNet, and XM3600—marking the first instance where a multilingual model comprehensively outperforms its monolingual counterpart. This result substantively breaks the multilingual curse, demonstrating that multilingual capability need not compromise English performance.

Technology Category

Application Category

📝 Abstract
Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
Problem

Research questions and friction points this paper is trying to address.

Scaling CLIP training for worldwide web data
Addressing the curse of multilinguality in CLIP
Improving multilingual performance without system-level changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains CLIP on worldwide web-scale image-text pairs
Improves multilingual performance without English degradation
Achieves state-of-the-art in multilingual benchmarks
🔎 Similar Papers
No similar papers found.