From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work proposes ARMADA, a novel framework for cross-modal knowledge distillation that operates without requiring access to the teacher model’s multimodal pretraining or internal architecture, thereby enabling black-box distillation from vision-language models to pure language models. By introducing a new alignment mechanism, ARMADA effectively overcomes the heterogeneity between modalities. The approach is compatible with diverse language model architectures—including DeBERTa, OPT, and LLaMA—and demonstrates consistent performance gains across a broad range of tasks: it achieves improvements of up to 3.4% on 12 language understanding benchmarks, 2.6% on 8 generative reasoning tasks, and notable enhancements on 5 instruction fine-tuning evaluations.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

Problem

Research questions and friction points this paper is trying to address.

cross-modal knowledge distillation

black-box teacher

vision-language models

language models

knowledge transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal knowledge distillation

black-box teacher

vision-language models