🤖 AI Summary
This work addresses the lack of cross-domain benchmarks and systematic studies for document-level multimodal information retrieval (MMIR). We introduce DocMMIR, the first large-scale (450K samples), cross-domain, text-image fused benchmark covering Wikipedia articles, academic papers, and presentation slides. We formally define and tackle the document-grained cross-domain MMIR task for the first time. To address it, we propose a CLIP-based customized training strategy integrating cross-modal alignment design and contrastive learning loss, achieving a 31% improvement in MRR@10. Extensive experiments reveal significant limitations of mainstream multimodal large language models (e.g., BLIP-2, SigLIP-2) on this task. We fully open-source the dataset and code, establishing a standardized evaluation platform and reproducible baselines for document-level MMIR research.
📝 Abstract
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.