FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing financial RAG research is confined to unimodal text, neglecting critical visual information—such as charts and tables—in financial reports, leading to inaccurate analysis. To address this, we introduce FinRAGBench-V, the first multimodal RAG benchmark for finance. Our method comprises: (1) a bilingual, human-annotated, cross-modal dataset covering seven domain-specific question-answering tasks, with explicit visual grounding support; (2) a novel multimodal RAG evaluation paradigm incorporating visual citations, integrating OCR-enhanced and layout-aware document parsing, vision–language alignment modeling, and RGenCite—a generative citation architecture; and (3) a rigorous evaluation metric for visual citation precision. Experiments reveal substantial deficiencies in visual citation capability among mainstream multimodal large language models (MLLMs); in contrast, RGenCite significantly enhances interpretability and traceability. FinRAGBench-V establishes a new standard for trustworthy, visually grounded reasoning in financial AI.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance which effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.
Problem

Research questions and friction points this paper is trying to address.

Lack of visual data integration in financial RAG systems
Need for traceable visual citations in financial analysis
Absence of standardized benchmarks for multimodal financial RAG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates multimodal data with visual citation
Introduces RGenCite for visual citation integration
Proposes automatic citation evaluation for MLLMs
🔎 Similar Papers
No similar papers found.
S
Suifeng Zhao
Key Laboratory of High Confidence Software Technologies, School of Computer Sciences, Peking University
Zhuoran Jin
Zhuoran Jin
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language ProcessingKnowledge Engineering
S
Sujian Li
State Key Laboratory of Multimedia Information Processing, School of Computer Sciences, Peking University
J
Jun Gao
Key Laboratory of High Confidence Software Technologies, School of Computer Sciences, Peking University