xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of multilingual multimodal embedding models, this paper proposes a self-knowledge distillation adaptation method for English-pretrained large vision-language models (LVLMs). The method integrates multilingual contrastive learning, cross-modal alignment loss, and a prompt-guided embedding disentanglement mechanism to enable joint cross-lingual image–text representation learning. We introduce the first dedicated multilingual multimodal evaluation benchmark, supporting cross-lingual image–text retrieval and alignment assessment. Extensive experiments across 12 languages and four multimodal tasks demonstrate substantial improvements over monolingual baselines: xVLM2Vec achieves an average 18.7% gain in retrieval accuracy over X-VLM and exhibits strong zero-shot cross-lingual transfer performance.

Technology Category

Application Category

📝 Abstract
In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.
Problem

Research questions and friction points this paper is trying to address.

Adapting English-trained Large Vision-Language Models for multilingual embeddings.
Improving performance in extracting multimodal and multilingual embeddings.
Introducing a benchmark for evaluating multilingual and multimodal embedding models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts LVLM for multilingual embedding extraction
Uses Self-Knowledge Distillation for model adaptation
Introduces benchmark for multilingual multimodal evaluation
🔎 Similar Papers
No similar papers found.