CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the high training cost and poor convergence of Mixture-of-Experts (MoE) CLIP models, this paper proposes “Sparse Upcycling”—a paradigm that efficiently converts a pre-trained dense CLIP model into a lightweight sparse MoE architecture without from-scratch training. Our method integrates knowledge-distillation-based parameter initialization, auxiliary loss design, and a learnable sparse routing mechanism. The resulting model significantly reduces both training overhead and inference FLOPs: it achieves absolute improvements of +7.2% and +6.6% in R@1 for text-to-image retrieval on COCO and Flickr30k, respectively—outperforming the larger CLIP-L/14 baseline while consuming only 30% of its inference compute. To our knowledge, this is the first sparse upcycling framework tailored for CLIP, uniquely balancing high performance, low computational cost, and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.

Problem

Research questions and friction points this paper is trying to address.

Multi-modal Models

Cost-efficient Training

Performance Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-UP

MoE-Sparse-Uplifting

Resource-Efficient CLIP Training

🔎 Similar Papers

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling