๐ค AI Summary
This work addresses the lack of comprehensive evaluation for complex reasoning capabilities of large language models (LLMs) in multilingual, multimodal financial question answering. We introduce FAMMA, the first domain-specific financial benchmark covering eight subdomains, three languages (Chinese, English, French), and multimodal inputs including charts and tables. To ensure rigorous assessment, we propose LiveProโa contamination-isolated evaluation subsetโand release a large-scale, human-annotated dataset of financial reasoning trajectories. Methodologically, we integrate multimodal data construction, expert annotation, trajectory distillation, and fine-tuning of Qwen-series models under a controlled evaluation protocol. Experiments reveal significant deficiencies in state-of-the-art LLMs (e.g., GPT-4o, DeepSeek-R1) on financial multimodal reasoning. In contrast, trajectory-augmented Qwen models achieve substantial performance gains on FAMMA-LivePro, empirically validating the efficacy of reasoning trajectory enhancement for domain-specific reasoning.
๐ Abstract
In this paper, we introduce FAMMA, an open-source benchmark for underline{f}inunderline{a}ncial underline{m}ultilingual underline{m}ultimodal question underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.