MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing financial LLM evaluation benchmarks suffer from limitations including monolinguality, single-modality design, and oversimplified tasks, failing to reflect the cross-lingual and multimodal complexity of real-world financial scenarios. To address this, we propose PolyFiBench—the first difficulty-aware, multilingual (e.g., English, Spanish), and multimodal (text, visual, speech) benchmark for global finance. Our contributions include: (1) novel cross-lingual question answering (PolyFiQA-Easy/Expert) and OCR-augmented document understanding tasks; (2) a dynamic difficulty-aware item selection mechanism and a unified multimodal fusion evaluation framework; and (3) multilingual alignment modeling and OCR-text joint reasoning techniques. Extensive experiments across 22 state-of-the-art models reveal significant performance degradation—up to 38%—on cross-lingual multimodal financial tasks, highlighting critical gaps in current capabilities. PolyFiBench is publicly released to foster fair, reproducible, and inclusive advancement of financial AI.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating financial LLMs across multilingual and multimodal settings

Assessing model performance on complex reasoning with mixed-language inputs

Challenging models with OCR-embedded financial document understanding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual multimodal benchmark for finance

Dynamic difficulty-aware task selection

OCR-embedded financial QA tasks

🔎 Similar Papers

No similar papers found.