🤖 AI Summary
This work addresses the absence of a unified evaluation benchmark for Multilingual Entity Recognition and Linking (MERL) in multimodal settings. We introduce the first dedicated multilingual multimodal entity linking benchmark, comprising BBC news headlines and corresponding images across five languages, annotated with over 7,000 entity mentions linked to more than 2,500 Wikidata entities. To tackle MERL, we propose a joint modeling approach that integrates multilingual large language models (e.g., LLaMA-2, Aya-23) with multimodal encoders to enable text–image collaborative disambiguation. Experimental results demonstrate that visual cues substantially improve cross-lingual linking accuracy—particularly mitigating textual ambiguity in low-resource languages. Our benchmark is the first to systematically quantify the critical gain from multimodal signals in multilingual entity linking, establishing a reproducible and extensible evaluation standard for future research.
📝 Abstract
This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin