FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of a unified evaluation framework for multimodal RAG systems operating across text, tables, and images—and across multiple documents. To this end, we propose FATHOMS-RAG, the first comprehensive benchmark for multimodal RAG. Its core contributions are: (1) a hand-crafted set of 93 challenging questions covering multimodal alignment and cross-document reasoning; (2) a phrase-level recall correctness metric and an embedding-based hallucination detection classifier, both validated by independent human annotators; and (3) an integrated evaluation pipeline combining open- and closed-source retrievers with multiple LLMs, leveraging embedded nearest-neighbor analysis and rigorous human annotation protocols for fine-grained diagnosis of multimodal understanding and hallucination. Experiments show that closed-source RAG systems significantly outperform open-source counterparts in answer correctness and hallucination mitigation—especially on cross-document and multimodal tasks. Human evaluation achieves high inter-annotator agreement (4.62/5 for correctness; 4.53/5 for hallucination).

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal RAG pipeline performance holistically
Assessing information retrieval across text, tables and images
Measuring factual accuracy and hallucination in RAG systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

A benchmark evaluates multimodal RAG pipeline performance
Phrase-level recall metric measures answer correctness
Nearest-neighbor classifier identifies potential pipeline hallucinations
🔎 Similar Papers
No similar papers found.
S
Samuel Hildebrand
Louisiana State University, Baton Rouge, LA, USA
Curtis Taylor
Curtis Taylor
Oak Ridge National Lab, Oak Ridge, TN, USA
Sean Oesch
Sean Oesch
Cybersecurity Researcher, Oak Ridge National Laboratory
cybersecuritynetwork securityusable security
J
James M Ghawaly Jr
Louisiana State University, Baton Rouge, LA, USA
A
Amir Sadovnik
Oak Ridge National Lab, Oak Ridge, TN, USA
R
Ryan Shivers
Oak Ridge National Lab, Oak Ridge, TN, USA
B
Brandon Schreiber
Oak Ridge National Lab, Oak Ridge, TN, USA
Kevin Kurian
Kevin Kurian
PhD Student, University of Florida
Machine LearningArtificial IntelligenceSecurity