🤖 AI Summary
To address the scarcity of multilingual multimodal embedding models, this paper proposes a self-knowledge distillation adaptation method for English-pretrained large vision-language models (LVLMs). The method integrates multilingual contrastive learning, cross-modal alignment loss, and a prompt-guided embedding disentanglement mechanism to enable joint cross-lingual image–text representation learning. We introduce the first dedicated multilingual multimodal evaluation benchmark, supporting cross-lingual image–text retrieval and alignment assessment. Extensive experiments across 12 languages and four multimodal tasks demonstrate substantial improvements over monolingual baselines: xVLM2Vec achieves an average 18.7% gain in retrieval accuracy over X-VLM and exhibits strong zero-shot cross-lingual transfer performance.
📝 Abstract
In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.