Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing benchmarks predominantly focus on simple image–text interactions, overlooking the prevalent need for chart understanding and reasoning in real-world scenarios. To address this gap, we introduce Chart-MRAG—a novel chart-oriented multimodal retrieval-augmented generation task—and present Chart-MRAG Bench, the first evaluation benchmark for complex chart documents, covering eight domains and 4,738 QA pairs. To ensure high-quality sample construction, we propose CHARGE, a semi-automatic framework integrating keypoint-based structural extraction, cross-modal consistency verification, and expert-in-the-loop validation. Empirical analysis reveals critical limitations of current multimodal large language models (MLLMs) on Chart-MRAG: strong text-dominant bias, retrieval failure, and performance bottlenecks—achieving only 58.19% Correctness and 73.87% Coverage. Both the benchmark and the CHARGE framework are publicly released.

Technology Category

Application Category

📝 Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in multimodal RAG benchmarks

Focuses on chart-based document question-answering

Evaluates multimodal embedding retrieval methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chart-based MRAG task

CHARGE framework generation

Crossmodal verification integration

🔎 Similar Papers

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

2024-08-02arXiv.orgCitations: 10

Qualcomm

$104,000.00 - $156,000.00

San Diego, California, United States of America

Research Scientist Intern, Multimodal AI (PhD)