🤖 AI Summary
This study systematically investigates how domain characteristics and scale of parallel data affect cross-lingual transfer performance of vision–language (VL) encoders—a previously underexplored issue. We propose a fine-tuning framework based on pretrained multilingual VL encoders, wherein only the text encoder is updated, combined with multilingual joint training and domain-controlled ablation experiments. Our key findings are: (1) machine-translated parallel data yields the best overall performance, though authentic subtitle-style data shows superior efficacy for certain low-resource languages; and (2) most languages exhibit substantial gains from scaling up multilingual parallel data. On benchmarks including XNLI and Flickr30k-CN, our approach achieves significant improvements in zero-shot cross-lingual transfer accuracy. These results empirically validate that domain alignment of parallel data and linguistic diversity are critical factors for enhancing VL model generalization across languages.
📝 Abstract
Most pre-trained Vision-Language (VL) models and training data for the downstream tasks are only available in English. Therefore, multilingual VL tasks are solved using cross-lingual transfer: fine-tune a multilingual pre-trained model or transfer the text encoder using parallel data. We study the alternative approach: transferring an already trained encoder using parallel data. We investigate the effect of parallel data: domain and the number of languages, which were out of focus in previous work. Our results show that even machine-translated task data are the best on average, caption-like authentic parallel data outperformed it in some languages. Further, we show that most languages benefit from multilingual training.