FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack rigorous evaluation of cross-modal multi-hop reasoning, primarily due to benchmark limitations—including data contamination, insufficient modality count (<3), and absence of explicit multi-hop constraints. Method: We introduce FCMR, the first financial-domain, trimodal (text/table/chart) multi-hop reasoning benchmark, featuring three key innovations: (1) contamination-resistant construction via real financial reports; (2) genuine trimodality with explicit multi-hop requirements—Hard-level tasks mandate full modality integration and prohibit modality omission; and (3) fine-grained error attribution analysis, supported by human verification and adversarial validation to ensure query complexity and annotation robustness. Results: Experiments reveal severe limitations: even the strongest model, Claude 3.5 Sonnet, achieves only 30.4% accuracy on Hard tasks. Information retrieval is identified as the critical bottleneck, exposing fundamental weaknesses in current MLLMs’ cross-modal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.

Problem

Research questions and friction points this paper is trying to address.

Evaluate multi-hop reasoning in MLLMs

Address data contamination in benchmarks

Assess cross-modal integration in financial domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Financial Cross-Modal Multi-Hop Reasoning

Three difficulty levels benchmarking

Analysis of information retrieval bottleneck

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting