FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper addresses the lack of a unified evaluation framework for multimodal RAG systems operating across text, tables, and images—and across multiple documents. To this end, we propose FATHOMS-RAG, the first comprehensive benchmark for multimodal RAG. Its core contributions are: (1) a hand-crafted set of 93 challenging questions covering multimodal alignment and cross-document reasoning; (2) a phrase-level recall correctness metric and an embedding-based hallucination detection classifier, both validated by independent human annotators; and (3) an integrated evaluation pipeline combining open- and closed-source retrievers with multiple LLMs, leveraging embedded nearest-neighbor analysis and rigorous human annotation protocols for fine-grained diagnosis of multimodal understanding and hallucination. Experiments show that closed-source RAG systems significantly outperform open-source counterparts in answer correctness and hallucination mitigation—especially on cross-document and multimodal tasks. Human evaluation achieves high inter-annotator agreement (4.62/5 for correctness; 4.53/5 for hallucination).

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal RAG pipeline performance holistically

Assessing information retrieval across text, tables and images

Measuring factual accuracy and hallucination in RAG systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

A benchmark evaluates multimodal RAG pipeline performance

Phrase-level recall metric measures answer correctness

Nearest-neighbor classifier identifies potential pipeline hallucinations

🔎 Similar Papers

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation