🤖 AI Summary
This study addresses the limited multimodal comprehension capabilities of vision-language models on non-English, high-stakes financial documents—such as French investment prospectuses—by introducing Multimodal Finance Eval, the first multimodal benchmark tailored to the French financial domain. The benchmark comprises 1,204 expert-validated questions spanning text extraction, table understanding, chart interpretation, and multi-turn dialog reasoning. Using an LLM-as-judge protocol, the authors systematically evaluate six open-source models ranging from 8B to 124B parameters. Results show strong performance (85–90% accuracy) on textual and tabular tasks, but significantly lower scores (34–62%) on chart understanding. Moreover, in multi-turn dialogues, performance drops sharply to around 50% due to error propagation, irrespective of model scale, revealing a critical vulnerability in current models for high-risk financial analysis.
📝 Abstract
Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.