🤖 AI Summary
Existing cross-modal image retrieval benchmarks lack rigorous evaluation capabilities for deep visual–linguistic co-understanding, particularly failing to support joint queries involving multi-entity images and relational text. To address this, we introduce MMIR—the first high-quality benchmark for hybrid-modality image retrieval—comprising the Entity Image (EI) dataset and the Mixed-Modality Image Retrieval (MMIR) dataset. We propose a novel “multi-entity image + relational text” query paradigm, formally defining and evaluating high-difficulty retrieval tasks that require cross-modal contextual alignment and semantic grounding. Built upon the WIT corpus, MMIR undergoes rigorous Wikidata entity alignment, human-crowdsourced validation, and cross-modal cleaning, ensuring reliability and reproducibility; it is fully open-sourced and supports both model training and strict evaluation. Empirical results demonstrate that MMIR significantly enhances evaluation validity for modeling entity associations and performing contextual reasoning in vision–language models.
📝 Abstract
Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.