FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) lack rigorous evaluation of cross-modal multi-hop reasoning, primarily due to benchmark limitations—including data contamination, insufficient modality count (<3), and absence of explicit multi-hop constraints. Method: We introduce FCMR, the first financial-domain, trimodal (text/table/chart) multi-hop reasoning benchmark, featuring three key innovations: (1) contamination-resistant construction via real financial reports; (2) genuine trimodality with explicit multi-hop requirements—Hard-level tasks mandate full modality integration and prohibit modality omission; and (3) fine-grained error attribution analysis, supported by human verification and adversarial validation to ensure query complexity and annotation robustness. Results: Experiments reveal severe limitations: even the strongest model, Claude 3.5 Sonnet, achieves only 30.4% accuracy on Hard tasks. Information retrieval is identified as the critical bottleneck, exposing fundamental weaknesses in current MLLMs’ cross-modal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.
Problem

Research questions and friction points this paper is trying to address.

Evaluate multi-hop reasoning in MLLMs
Address data contamination in benchmarks
Assess cross-modal integration in financial domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Financial Cross-Modal Multi-Hop Reasoning
Three difficulty levels benchmarking
Analysis of information retrieval bottleneck
🔎 Similar Papers
No similar papers found.