MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing visual document retrieval benchmarks suffer from three key limitations: monolinguality (English-only), reliance on synthetic queries, and small, low-fidelity corpora. This work introduces DocVQA-Multilingual, the first large-scale, multilingual visual document retrieval benchmark, covering 18 languages and emphasizing complex-layout documents containing charts, tables, and heterogeneous visual elements. To enhance realism and scalability, we propose an “easy negative removal” strategy that reduces corpus size while preserving task difficulty; all queries are manually annotated to ensure high fidelity and challenge. Experiments reveal severe deficiencies in current vision-language models’ (VLMs) multilingual embedding capabilities—underperforming monolingual text models by 59.7% on average across languages, and even lagging by 12.1% on English alone. The benchmark is publicly released to enable reproducible, computationally efficient cross-modal retrieval evaluation.

Technology Category

Application Category

📝 Abstract

Document retrieval is an important task for search and Retrieval-Augmented Generation (RAG) applications. Large Language Models (LLMs) have contributed to improving the accuracy of text-based document retrieval. However, documents with complex layout and visual elements like tables, charts and infographics are not perfectly represented in textual format. Recently, image-based document retrieval pipelines have become popular, which use visual large language models (VLMs) to retrieve relevant page images given a query. Current evaluation benchmarks on visual document retrieval are limited, as they primarily focus only English language, rely on synthetically generated questions and offer a small corpus size. Therefore, we introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark. MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset, a popular benchmark to evaluate text-based multilingual retrieval pipelines. MIRACL was built using a human-intensive annotation process to generate high-quality questions. In order to reduce MIRACL-VISION corpus size to make evaluation more compute friendly while keeping the datasets challenging, we have designed a method for eliminating the"easy"negatives from the corpus. We conducted extensive experiments comparing MIRACL-VISION with other benchmarks, using popular public text and image models. We observe a gap in state-of-the-art VLM-based embedding models on multilingual capabilities, with up to 59.7% lower retrieval accuracy than a text-based retrieval models. Even for the English language, the visual models retrieval accuracy is 12.1% lower compared to text-based models. MIRACL-VISION is a challenging, representative, multilingual evaluation benchmark for visual retrieval pipelines and will help the community build robust models for document retrieval.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multilingual visual document retrieval benchmarks

Improving retrieval accuracy for visually complex documents using VLMs

Reducing corpus size while maintaining challenge for evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual visual document retrieval benchmark

Human-annotated high-quality questions dataset

Method to eliminate easy negatives corpus

🔎 Similar Papers

No similar papers found.