🤖 AI Summary
This work addresses the limited support for low-resource languages in existing news image captioning research, which has predominantly focused on English. To bridge this gap, the authors introduce MUNIChus, the first multilingual benchmark for news image captioning spanning nine languages—including low-resource ones such as Sinhala and Urdu. The dataset leverages both news articles and associated images to generate descriptive captions and enables a systematic evaluation of over twenty state-of-the-art multimodal models. MUNIChus fills a critical data void in cross-lingual multimodal understanding, highlighting the inherent challenges and untapped potential of multilingual news image captioning, and establishes a foundational resource and benchmark for future research in this domain.
📝 Abstract
The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.