Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates how domain characteristics and scale of parallel data affect cross-lingual transfer performance of vision–language (VL) encoders—a previously underexplored issue. We propose a fine-tuning framework based on pretrained multilingual VL encoders, wherein only the text encoder is updated, combined with multilingual joint training and domain-controlled ablation experiments. Our key findings are: (1) machine-translated parallel data yields the best overall performance, though authentic subtitle-style data shows superior efficacy for certain low-resource languages; and (2) most languages exhibit substantial gains from scaling up multilingual parallel data. On benchmarks including XNLI and Flickr30k-CN, our approach achieves significant improvements in zero-shot cross-lingual transfer accuracy. These results empirically validate that domain alignment of parallel data and linguistic diversity are critical factors for enhancing VL model generalization across languages.

Technology Category

Application Category

📝 Abstract
Most pre-trained Vision-Language (VL) models and training data for the downstream tasks are only available in English. Therefore, multilingual VL tasks are solved using cross-lingual transfer: fine-tune a multilingual pre-trained model or transfer the text encoder using parallel data. We study the alternative approach: transferring an already trained encoder using parallel data. We investigate the effect of parallel data: domain and the number of languages, which were out of focus in previous work. Our results show that even machine-translated task data are the best on average, caption-like authentic parallel data outperformed it in some languages. Further, we show that most languages benefit from multilingual training.
Problem

Research questions and friction points this paper is trying to address.

Effect of parallel data on cross-lingual transfer for VL encoders
Impact of domain and language count in parallel data
Performance comparison: machine-translated vs authentic parallel data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfer trained encoder using parallel data
Study parallel data domain and languages
Machine-translated task data performs best
🔎 Similar Papers
No similar papers found.
A
Andrei-Alexandru Manea
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, V Holešovičkách 2, 180 00 Prague, Czech Republic
Jindřich Libovický
Jindřich Libovický
Charles University
natural language processingmultilingualityneural machine translationlanguage and vision