🤖 AI Summary
Existing RAG evaluation frameworks focus predominantly on textual retrieval, rely on end-to-end black-box assessment, and lack fine-grained, attributable evaluation for multimodal scenarios or individual pipeline components.
Method: We introduce MMRAG-Bench—the first modular, multimodal RAG benchmark supporting text, tables, and knowledge graphs. It features a novel unified multimodal documentation transformation and a hierarchical relevance annotation framework, enabling decoupled, reproducible, and transparent evaluation of core components (e.g., retrieval, query routing). The benchmark integrates six diverse QA datasets, employs human and semi-automated annotation, and incorporates multiple RAG baselines.
Results: Experiments uncover pervasive cross-modal retrieval biases and query routing failures, providing quantifiable insights for component-level optimization. MMRAG-Bench advances RAG evaluation from opaque, holistic assessment toward interpretable, white-box analysis—establishing a new standard for principled multimodal RAG evaluation.
📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. However, existing RAG evaluation predominantly focuses on text retrieval and relies on opaque, end-to-end assessments of generated outputs. To address these limitations, we introduce mmRAG, a modular benchmark designed for evaluating multi-modal RAG systems. Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs, which we uniformly convert into retrievable documents. To enable direct, granular evaluation of individual RAG components -- such as the accuracy of retrieval and query routing -- beyond end-to-end generation quality, we follow standard information retrieval procedures to annotate document relevance and derive dataset relevance. We establish baseline performance by evaluating a wide range of RAG implementations on mmRAG.