REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal document retrieval benchmarks lack real-world applicability, failing to adequately evaluate RAG systems on complex documents (e.g., table-rich financial reports) and robust query handling. This work introduces REAL-MM-RAG—the first realistic, application-oriented multimodal document retrieval benchmark—featuring domain-specific, high-difficulty tabular data from finance, multi-granularity query paraphrasing, and precise human annotations. Methodologically, we propose an automated benchmark construction pipeline, a query-paraphrasing-driven difficulty stratification mechanism, and a table-structure-aware multimodal embedding fine-tuning and retrieval framework. Experiments demonstrate substantial improvements in both tabular document retrieval accuracy and query robustness, achieving state-of-the-art performance on REAL-MM-RAG. To foster reproducibility and community advancement, we fully open-source the benchmark data, implementation code, and trained models.

Technology Category

Application Category

📝 Abstract
Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.
Problem

Research questions and friction points this paper is trying to address.

Multi-modal document retrieval challenges
Real-world retrieval benchmark design
Enhancing model robustness and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generated multi-modal benchmark
Multi-difficulty-level query rephrasing scheme
Finance-focused table-heavy dataset curation
🔎 Similar Papers
No similar papers found.
Navve Wasserman
Navve Wasserman
Unknown affiliation
R
Roi Pony
IBM Research Israel
O
O. Naparstek
IBM Research Israel
A
Adi Raz Goldfarb
IBM Research Israel
E
Eli Schwartz
IBM Research Israel
U
Udi Barzelay
IBM Research Israel
Leonid Karlinsky
Leonid Karlinsky
Principal Research Scientist, MIT-IBM Watson AI Lab, IBM Research
Computer Vision