🤖 AI Summary
Existing multimodal document retrieval benchmarks lack real-world applicability, failing to adequately evaluate RAG systems on complex documents (e.g., table-rich financial reports) and robust query handling. This work introduces REAL-MM-RAG—the first realistic, application-oriented multimodal document retrieval benchmark—featuring domain-specific, high-difficulty tabular data from finance, multi-granularity query paraphrasing, and precise human annotations. Methodologically, we propose an automated benchmark construction pipeline, a query-paraphrasing-driven difficulty stratification mechanism, and a table-structure-aware multimodal embedding fine-tuning and retrieval framework. Experiments demonstrate substantial improvements in both tabular document retrieval accuracy and query robustness, achieving state-of-the-art performance on REAL-MM-RAG. To foster reproducibility and community advancement, we fully open-source the benchmark data, implementation code, and trained models.
📝 Abstract
Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.