🤖 AI Summary
This work addresses the strong English bias in current vision-language models (VLMs), which stems from a scarcity of multilingual training data and the absence of unified evaluation benchmarks. To bridge this gap, the authors propose a regeneration–translation paradigm to systematically construct a high-quality multilingual vision-language resource suite covering five European languages. This includes Multi-PixMo—a synthetically generated and human-refined training corpus—and a corresponding multilingual evaluation benchmark. Through cross-lingual alignment analyses and multi-model ablation studies on three prominent VLMs, they demonstrate that multilingual training not only substantially improves performance on non-English tasks but also yields positive transfer effects on English tasks. Human evaluations further confirm the high consistency and quality of the constructed resources.
📝 Abstract
Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.