CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

📅 2024-09-28
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
CLIP suffers from significant fine-grained visual information loss due to its monolithic feature encoding. To address this, we propose a model-agnostic Diversified Multi-Expert Upgrade (DMU) strategy—the first to integrate sparsely activated Mixture-of-Experts (MoE) into the CLIP architecture. Leveraging parameter sharing (with independent feed-forward networks only) and dynamic routing, DMU efficiently distills a single dense CLIP checkpoint into multiple complementary expert submodels, yielding a zero-adaptation, plug-and-play CLIP-MoE. Crucially, DMU requires no modifications to downstream frameworks and supports direct, seamless upgrade of any pre-trained CLIP checkpoint. Evaluated on zero-shot image–text retrieval, image classification, and MLLM visual encoding tasks, DMU consistently improves performance by 3.2–7.8% while increasing computational overhead by less than 5%.

Technology Category

Application Category

📝 Abstract
In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies have identified that the information loss in the CLIP encoding process is substantial, and CLIP tends to capture only coarse-grained features from the input. This deficiency significantly limits the ability of a single CLIP model to handle images rich in visual detail. In this work, we propose a simple yet effective model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for the Feed-Forward Network (FFN). These models can then be transformed into a CLIP-MoE with a larger model capacity, leading to significantly enhanced performance with minimal computational overhead. To the best of our knowledge, Diversified Multiplet Upcycling is the first approach to introduce sparsely activated MoE into CLIP foundation models. Extensive experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks by serving as a vision encoder. Furthermore, Diversified Multiplet Upcycling enables the conversion of any dense CLIP model into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner without requiring further adaptation in downstream frameworks. Through Diversified Multiplet Upcycling, we aim to provide valuable insights for future research on developing more efficient and effective multimodal learning systems.
Problem

Research questions and friction points this paper is trying to address.

Enhance CLIP's feature diversity to reduce information loss
Develop cost-effective method for creating diverse CLIP models
Optimize model capacity and computational cost via CLIP-MoE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversified Multiplet Upcycling for CLIP enhancement
Multistage contrastive learning for cost-effective fine-tuning
Dynamic expert activation in CLIP-MoE for efficiency