REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing multimodal document retrieval benchmarks lack real-world applicability, failing to adequately evaluate RAG systems on complex documents (e.g., table-rich financial reports) and robust query handling. This work introduces REAL-MM-RAG—the first realistic, application-oriented multimodal document retrieval benchmark—featuring domain-specific, high-difficulty tabular data from finance, multi-granularity query paraphrasing, and precise human annotations. Methodologically, we propose an automated benchmark construction pipeline, a query-paraphrasing-driven difficulty stratification mechanism, and a table-structure-aware multimodal embedding fine-tuning and retrieval framework. Experiments demonstrate substantial improvements in both tabular document retrieval accuracy and query robustness, achieving state-of-the-art performance on REAL-MM-RAG. To foster reproducibility and community advancement, we fully open-source the benchmark data, implementation code, and trained models.

Technology Category

Application Category

📝 Abstract

Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.

Problem

Research questions and friction points this paper is trying to address.

Multi-modal document retrieval challenges

Real-world retrieval benchmark design

Enhancing model robustness and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generated multi-modal benchmark

Multi-difficulty-level query rephrasing scheme

Finance-focused table-heavy dataset curation

🔎 Similar Papers

No similar papers found.