DocMMIR: A Framework for Document Multi-modal Information Retrieval

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of cross-domain benchmarks and systematic studies for document-level multimodal information retrieval (MMIR). We introduce DocMMIR, the first large-scale (450K samples), cross-domain, text-image fused benchmark covering Wikipedia articles, academic papers, and presentation slides. We formally define and tackle the document-grained cross-domain MMIR task for the first time. To address it, we propose a CLIP-based customized training strategy integrating cross-modal alignment design and contrastive learning loss, achieving a 31% improvement in MRR@10. Extensive experiments reveal significant limitations of mainstream multimodal large language models (e.g., BLIP-2, SigLIP-2) on this task. We fully open-source the dataset and code, establishing a standardized evaluation platform and reproducible baselines for document-level MMIR research.

Technology Category

Application Category

📝 Abstract
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.
Problem

Research questions and friction points this paper is trying to address.

Lack of document-level multi-modal retrieval exploration
Absence of cross-domain datasets for document retrieval
Limitations of current models in document retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies diverse document formats and domains
Constructs large-scale cross-domain multimodal benchmark
Tailored CLIP training improves retrieval performance
🔎 Similar Papers
No similar papers found.