Active Data Curation Effectively Distills Large-Scale Multimodal Models

📅 2024-11-27

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the degradation of zero-shot transfer performance and high inference overhead in large-scale multimodal model compression, this paper proposes Active Curriculum for Image-Text Distillation (ACED), replacing conventional knowledge distillation with active data curation (ACID) as the core lightweighting mechanism. ACED is the first framework to deeply integrate active sample selection with contrastive pretraining: it dynamically optimizes the data distribution for image-text contrastive learning via online batch selection and synergistically combines the LiT-Decoder architecture with knowledge distillation. Evaluated on 27 zero-shot classification and retrieval benchmarks, ACED achieves state-of-the-art performance while reducing inference FLOPs by up to 11%. Moreover, as a visual encoder, it surpasses larger models on image captioning and VQA tasks. These results demonstrate the effectiveness and strong generalization capability of the active, data-driven paradigm for multimodal model compression.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving knowledge distillation for multimodal models

Enhancing efficiency in contrastive pretraining via data curation

Reducing inference costs while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active data curation for contrastive multimodal pretraining

Online batch selection method ACID outperforms KD baselines

ACED framework achieves state-of-the-art with less FLOPs

🔎 Similar Papers

No similar papers found.