Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study addresses cultural bias and evaluation distortion in large vision-language models (LVLMs) for non-English art explanation generation, stemming from English-only pretraining and machine-translated evaluation benchmarks. To this end, we introduce MultiExpArt—the first machine-translation-free, multilingual, human-annotated art explanation dataset, covering Chinese, Japanese, French, Spanish, Arabic, and other languages. Methodologically, we conduct cross-lingual consistency analysis and instruction-tuning experiments to systematically evaluate LVLMs’ multilingual art description capabilities and the transferability of English-instruction fine-tuning. Results reveal that LVLMs exhibit significantly lower non-English generation quality compared to English, and English instruction tuning yields only marginal cross-lingual improvements—indicating a fundamental bottleneck in transferring art-understanding knowledge across languages. Our contributions are threefold: (1) a high-quality, multilingual art explanation benchmark; (2) the first empirical demonstration of LVLMs’ cross-lingual semantic generalization limitations; and (3) public release of the dataset on Hugging Face.

Technology Category

Application Category

📝 Abstract

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data. Our dataset is available at https://huggingface.co/datasets/naist-nlp/MultiExpArt

Problem

Research questions and friction points this paper is trying to address.

Cross-lingual explanation in LVLMs

Impact of English training on multilingual performance

Cultural biases in multilingual QA benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended multilingual dataset creation

Instruction-Tuning for language performance

Evaluation of LVLMs' explanation generation

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?