When Better Teachers Don't Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing knowledge distillation (KD) methods for vision-language models (VLMs) are constrained by small teacher models, narrow evaluation settings, and poor scalability to large-scale CLIP-style VLMs. Method: We conduct a systematic investigation of CLIP-style VLMs as teachers in KD, focusing on multimodal downstream tasks—particularly visual question answering (VQA)—across multiple teacher scales and architectural configurations. Contribution/Results: We present the first empirical evidence that “stronger teachers do not necessarily yield better students,” challenging the conventional assumption of monotonic teacher-student performance correlation in KD. Our VQA-oriented distillation experiments expose structural limitations in current KD frameworks: scaling up teacher models often degrades downstream performance due to misaligned cross-modal representations and ineffective representation transfer. The study establishes a new benchmark for multimodal KD, delivers critical insights into teacher-student dynamics in large VLMs, and provides a reproducible, empirically grounded foundation for future research.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have achieved remarkable success across multimodal tasks, yet their substantial computational demands hinder efficient deployment. Knowledge distillation (KD) has emerged as a powerful approach for building lightweight but competitive models, with strong evidence from both language and vision domains. However, its application to VLMs, particularly CLIP-style models, remains limited, often constrained to small-scale teachers and narrow evaluation tasks such as classification or retrieval. In this work, we present the first systematic study of distillation across a range of CLIP-style teacher models, ranging from standard baselines to large-scale state-of-the-art models. Contrary to trends observed in NLP and vision, we find that stronger teachers do not consistently yield better students; in fact, existing distillation frameworks often fail to scale, leading to degraded performance in downstream multimodal tasks such as visual question answering. Our findings challenge prevailing assumptions in KD and point toward new directions for designing parameter-efficient multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Investigating why stronger CLIP teachers fail to improve student performance

Challenging knowledge distillation scalability assumptions in multimodal models

Addressing degraded VQA performance from existing distillation frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling CLIP models across varying teacher scales

Challenging stronger teachers yield better students assumption

Proposing new directions for efficient multimodal models

🔎 Similar Papers

No similar papers found.