Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During visual-language model (VLM) compression, multilingual performance degradation intensifies and exhibits pronounced imbalance. This work systematically investigates knowledge distillation (KD) as an adaptation mechanism for multilingual VLM compression. Leveraging CLIP and SigLIP architectures, we design and comparatively evaluate five KD strategies through controlled experiments on in-domain cross-lingual image–text retrieval and out-of-domain multilingual visual question answering (VQA). We first reveal a sensitive trade-off across KD configurations between cross-lingual representation consistency and cross-task stability. Notably, certain strategies—e.g., intermediate-layer feature distillation with language-aware weighting—maintain or even improve multilingual retrieval performance (average +1.2% mAP) under 50% parameter reduction, yet induce substantial instability in multilingual VQA (±4.8% fluctuation). Our study establishes a reproducible methodology and empirical benchmark for efficient, robust multilingual VLM compression.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.
Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual performance imbalance in compressed vision-language models
Evaluating knowledge distillation approaches for cross-lingual representation consistency
Analyzing trade-offs between model compression and multilingual task stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated five distillation approaches for multilingual VLMs
Studied cross-lingual representation consistency under compression
Identified configurations preserving multilingual robustness with smaller models
🔎 Similar Papers
No similar papers found.