🤖 AI Summary
This paper addresses the lack of a unified evaluation framework for multimodal RAG systems operating across text, tables, and images—and across multiple documents. To this end, we propose FATHOMS-RAG, the first comprehensive benchmark for multimodal RAG. Its core contributions are: (1) a hand-crafted set of 93 challenging questions covering multimodal alignment and cross-document reasoning; (2) a phrase-level recall correctness metric and an embedding-based hallucination detection classifier, both validated by independent human annotators; and (3) an integrated evaluation pipeline combining open- and closed-source retrievers with multiple LLMs, leveraging embedded nearest-neighbor analysis and rigorous human annotation protocols for fine-grained diagnosis of multimodal understanding and hallucination. Experiments show that closed-source RAG systems significantly outperform open-source counterparts in answer correctness and hallucination mitigation—especially on cross-document and multimodal tasks. Human evaluation achieves high inter-annotator agreement (4.62/5 for correctness; 4.53/5 for hallucination).
📝 Abstract
Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").