Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

152K/year
🤖 AI Summary
Existing benchmarks predominantly focus on simple image–text interactions, overlooking the prevalent need for chart understanding and reasoning in real-world scenarios. To address this gap, we introduce Chart-MRAG—a novel chart-oriented multimodal retrieval-augmented generation task—and present Chart-MRAG Bench, the first evaluation benchmark for complex chart documents, covering eight domains and 4,738 QA pairs. To ensure high-quality sample construction, we propose CHARGE, a semi-automatic framework integrating keypoint-based structural extraction, cross-modal consistency verification, and expert-in-the-loop validation. Empirical analysis reveals critical limitations of current multimodal large language models (MLLMs) on Chart-MRAG: strong text-dominant bias, retrieval failure, and performance bottlenecks—achieving only 58.19% Correctness and 73.87% Coverage. Both the benchmark and the CHARGE framework are publicly released.

Technology Category

Application Category

📝 Abstract
Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in multimodal RAG benchmarks
Focuses on chart-based document question-answering
Evaluates multimodal embedding retrieval methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chart-based MRAG task
CHARGE framework generation
Crossmodal verification integration
🔎 Similar Papers
Yuming Yang
Yuming Yang
Fudan University
Natural Language ProcessingLarge Language Models
J
Jiang Zhong
College of Computer Science, Chongqing University, China
L
Li Jin
Aerospace Information Research Institute, Chinese Academy of Sciences, China
J
Jingwang Huang
College of Computer Science, Chongqing University, China
J
Jingpeng Gao
College of Computer Science, Chongqing University, China
Q
Qing Liu
Aerospace Information Research Institute, Chinese Academy of Sciences, China
Y
Yang Bai
Aerospace Information Research Institute, Chinese Academy of Sciences, China
J
Jingyuan Zhang
Kuaishou Technology, Beijing, China
Rui Jiang
Rui Jiang
Tsinghua University
Bioinformatics
K
Kaiwen Wei
College of Computer Science, Chongqing University, China