mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG evaluation frameworks focus predominantly on textual retrieval, rely on end-to-end black-box assessment, and lack fine-grained, attributable evaluation for multimodal scenarios or individual pipeline components. Method: We introduce MMRAG-Bench—the first modular, multimodal RAG benchmark supporting text, tables, and knowledge graphs. It features a novel unified multimodal documentation transformation and a hierarchical relevance annotation framework, enabling decoupled, reproducible, and transparent evaluation of core components (e.g., retrieval, query routing). The benchmark integrates six diverse QA datasets, employs human and semi-automated annotation, and incorporates multiple RAG baselines. Results: Experiments uncover pervasive cross-modal retrieval biases and query routing failures, providing quantifiable insights for component-level optimization. MMRAG-Bench advances RAG evaluation from opaque, holistic assessment toward interpretable, white-box analysis—establishing a new standard for principled multimodal RAG evaluation.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. However, existing RAG evaluation predominantly focuses on text retrieval and relies on opaque, end-to-end assessments of generated outputs. To address these limitations, we introduce mmRAG, a modular benchmark designed for evaluating multi-modal RAG systems. Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs, which we uniformly convert into retrievable documents. To enable direct, granular evaluation of individual RAG components -- such as the accuracy of retrieval and query routing -- beyond end-to-end generation quality, we follow standard information retrieval procedures to annotate document relevance and derive dataset relevance. We establish baseline performance by evaluating a wide range of RAG implementations on mmRAG.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-modal RAG systems beyond text retrieval
Assessing individual RAG components like retrieval accuracy
Standardizing relevance annotation for diverse data formats
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular benchmark for multi-modal RAG evaluation
Integrates text, tables, and knowledge graphs queries
Annotates document relevance for granular component assessment