🤖 AI Summary
Existing knowledge distillation (KD) methods for vision-language models (VLMs) are constrained by small teacher models, narrow evaluation settings, and poor scalability to large-scale CLIP-style VLMs.
Method: We conduct a systematic investigation of CLIP-style VLMs as teachers in KD, focusing on multimodal downstream tasks—particularly visual question answering (VQA)—across multiple teacher scales and architectural configurations.
Contribution/Results: We present the first empirical evidence that “stronger teachers do not necessarily yield better students,” challenging the conventional assumption of monotonic teacher-student performance correlation in KD. Our VQA-oriented distillation experiments expose structural limitations in current KD frameworks: scaling up teacher models often degrades downstream performance due to misaligned cross-modal representations and ineffective representation transfer. The study establishes a new benchmark for multimodal KD, delivers critical insights into teacher-student dynamics in large VLMs, and provides a reproducible, empirically grounded foundation for future research.
📝 Abstract
Vision-language models (VLMs) have achieved remarkable success across multimodal tasks, yet their substantial computational demands hinder efficient deployment. Knowledge distillation (KD) has emerged as a powerful approach for building lightweight but competitive models, with strong evidence from both language and vision domains. However, its application to VLMs, particularly CLIP-style models, remains limited, often constrained to small-scale teachers and narrow evaluation tasks such as classification or retrieval. In this work, we present the first systematic study of distillation across a range of CLIP-style teacher models, ranging from standard baselines to large-scale state-of-the-art models. Contrary to trends observed in NLP and vision, we find that stronger teachers do not consistently yield better students; in fact, existing distillation frameworks often fail to scale, leading to degraded performance in downstream multimodal tasks such as visual question answering. Our findings challenge prevailing assumptions in KD and point toward new directions for designing parameter-efficient multimodal models.