Image captioning in different languages

📅 2024-05-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the severe scarcity of non-English image captioning data in multilingual vision-and-language (V&L) research—current datasets cover only 23 languages, far fewer than the ~500 institutionalized languages worldwide. We systematically survey all non-English image captioning datasets available as of May 2024, constructing the first manually curated, reproducible dataset inventory and metadata database. By integrating resources such as Crossmodal-3600, we extend language coverage to 36 languages and conduct cross-dataset statistical analysis and field-wide diagnostic assessment. Our contributions are threefold: (1) releasing the most comprehensive index of non-English image captioning datasets to date; (2) quantitatively characterizing the language coverage gap, underscoring the urgency of advancing low-resource-language V&L research; and (3) proposing an open-problem framework for multilingual image captioning, offering systematic guidance for benchmark development and methodological innovation.

Technology Category

Application Category

📝 Abstract

This short position paper provides a manually curated list of non-English image captioning datasets (as of May 2024). Through this list, we can observe the dearth of datasets in different languages: only 23 different languages are represented. With the addition of the Crossmodal-3600 dataset (Thapliyal et al., 2022, 36 languages) this number increases somewhat, but still this number is small compared to the +/-500 institutional languages that are out there. This paper closes with some open questions for the field of Vision&Language.

Problem

Research questions and friction points this paper is trying to address.

Lack of non-English image captioning datasets

Only 23 languages represented currently

Need for more diverse language coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually curated non-English image captioning datasets

Identified dearth of datasets in 23 languages

Highlighted need for more Vision & Language research

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis