FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

๐Ÿ“… 2024-10-06
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 4
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of comprehensive evaluation for complex reasoning capabilities of large language models (LLMs) in multilingual, multimodal financial question answering. We introduce FAMMA, the first domain-specific financial benchmark covering eight subdomains, three languages (Chinese, English, French), and multimodal inputs including charts and tables. To ensure rigorous assessment, we propose LiveProโ€”a contamination-isolated evaluation subsetโ€”and release a large-scale, human-annotated dataset of financial reasoning trajectories. Methodologically, we integrate multimodal data construction, expert annotation, trajectory distillation, and fine-tuning of Qwen-series models under a controlled evaluation protocol. Experiments reveal significant deficiencies in state-of-the-art LLMs (e.g., GPT-4o, DeepSeek-R1) on financial multimodal reasoning. In contrast, trajectory-augmented Qwen models achieve substantial performance gains on FAMMA-LivePro, empirically validating the efficacy of reasoning trajectory enhancement for domain-specific reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we introduce FAMMA, an open-source benchmark for underline{f}inunderline{a}ncial underline{m}ultilingual underline{m}ultimodal question underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on financial multilingual multimodal QA tasks
Assessing complex reasoning with advanced financial knowledge
Challenging models with non-text data in finance questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual multimodal financial QA benchmark
Human expert curated advanced finance questions
Fine tuned models with reasoning trajectories
๐Ÿ”Ž Similar Papers
No similar papers found.