๐ค AI Summary
This work addresses the severe scarcity of non-English image captioning data in multilingual vision-and-language (V&L) researchโcurrent datasets cover only 23 languages, far fewer than the ~500 institutionalized languages worldwide. We systematically survey all non-English image captioning datasets available as of May 2024, constructing the first manually curated, reproducible dataset inventory and metadata database. By integrating resources such as Crossmodal-3600, we extend language coverage to 36 languages and conduct cross-dataset statistical analysis and field-wide diagnostic assessment. Our contributions are threefold: (1) releasing the most comprehensive index of non-English image captioning datasets to date; (2) quantitatively characterizing the language coverage gap, underscoring the urgency of advancing low-resource-language V&L research; and (3) proposing an open-problem framework for multilingual image captioning, offering systematic guidance for benchmark development and methodological innovation.
๐ Abstract
This short position paper provides a manually curated list of non-English image captioning datasets (as of May 2024). Through this list, we can observe the dearth of datasets in different languages: only 23 different languages are represented. With the addition of the Crossmodal-3600 dataset (Thapliyal et al., 2022, 36 languages) this number increases somewhat, but still this number is small compared to the +/-500 institutional languages that are out there. This paper closes with some open questions for the field of Vision&Language.