🤖 AI Summary
This study addresses the lack of systematic evaluation of multimodal perception, comprehension, and reasoning capabilities of large vision-language models (LVLMs) in Chinese financial contexts. To bridge this gap, we propose CFMME, the first comprehensive multimodal benchmark tailored to the Chinese financial domain, encompassing eight image modalities, four core tasks, and 6,052 real-world business instances. We conduct a systematic assessment of prominent LVLMs across multimodal question answering, object detection, text recognition, and information extraction tasks. Experimental results reveal that even the best-performing model achieves only 66.11% accuracy on question answering, with an average score of 77.18 across the remaining tasks, highlighting substantial room for improvement in handling complex financial scenarios.
📝 Abstract
The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.