MobileCLIP2: Improving Multi-Modal Reinforced Training

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the suboptimal zero-shot classification performance of lightweight multimodal models under low-latency constraints, this paper proposes MobileCLIP2—a refined multimodal reinforcement training framework. Methodologically, it introduces a multi-teacher ensemble built upon the DFN dataset, incorporates temperature-aware tuning and diverse synthetic caption fusion, and synergistically integrates knowledge distillation, contrastive learning, and fine-tuning of a distributed caption generator. Experimental results demonstrate that MobileCLIP2-B achieves a 2.2% accuracy gain over the original MobileCLIP on ImageNet-1K zero-shot classification. Moreover, MobileCLIP2-S4 attains comparable accuracy to SigLIP-SO400M/14 while reducing parameter count by 50% and inference latency by 2.5×, significantly improving the accuracy–efficiency trade-off for edge deployment.

Technology Category

Application Category

📝 Abstract

Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$ imes$ smaller and improves on DFN ViT-L/14 at 2.5$ imes$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-modal training efficiency for lightweight CLIP models

Improving zero-shot accuracy with optimized teacher ensembles

Reducing model latency while maintaining state-of-the-art performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced CLIP teacher ensembles using DFN dataset

Improved captioner teachers via fine-tuning on diverse datasets

Combined synthetic captions from multiple models for improvement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs