🤖 AI Summary
This work addresses the absence of a multimodal evaluation benchmark in financial credit assessment that simultaneously ensures privacy compliance and practical utility. To this end, we propose FCMBench-V1.0, the first multimodal benchmark tailored for financial credit scenarios. It features a synthetically generated yet realistically captured dataset comprising 4,043 privacy-compliant images across 18 document types and 8,446 question-answer pairs. The benchmark encompasses three perception tasks, four credit-reasoning tasks, and ten robustness evaluations—including real-world perturbations. Among 23 prominent vision-language models evaluated, Qfin-VL-Instruct achieves the highest F1 score of 64.92, while Gemini 3 Pro (64.61) and Qwen3-VL-235B (57.27) lead among commercial and open-source models, respectively. Notably, all models exhibit significant performance degradation under perturbations, underscoring the necessity of rigorous and realistic evaluation.
📝 Abstract
As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 -- a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(\%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.