FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of a multimodal evaluation benchmark in financial credit assessment that simultaneously ensures privacy compliance and practical utility. To this end, we propose FCMBench-V1.0, the first multimodal benchmark tailored for financial credit scenarios. It features a synthetically generated yet realistically captured dataset comprising 4,043 privacy-compliant images across 18 document types and 8,446 question-answer pairs. The benchmark encompasses three perception tasks, four credit-reasoning tasks, and ten robustness evaluations—including real-world perturbations. Among 23 prominent vision-language models evaluated, Qfin-VL-Instruct achieves the highest F1 score of 64.92, while Gemini 3 Pro (64.61) and Qwen3-VL-235B (57.27) lead among commercial and open-source models, respectively. Notably, all models exhibit significant performance degradation under perturbations, underscoring the necessity of rigorous and realistic evaluation.

Technology Category

Application Category

📝 Abstract
As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 -- a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(\%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.
Problem

Research questions and friction points this paper is trying to address.

financial credit
multimodal benchmark
privacy compliance
real-world robustness
credit risk assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark
financial credit assessment
privacy-compliant synthesis
vision-language model evaluation
robustness to acquisition artifacts
🔎 Similar Papers
No similar papers found.
Yehui Yang
Yehui Yang
Baidu, Bytedance
Computer visionMultimodal machine learning related applications
Dalu Yang
Dalu Yang
Baidu Inc
Computer VisionMedical Image Analysis
W
Wenshuo Zhou
AI Lab, Qifu Technology, Beijing, China
F
Fangxin Shang
AI Lab, Qifu Technology, Beijing, China
Yifan Liu
Yifan Liu
Shanghai Jiao Tong University
Data Mining
J
Jie Ren
College of Future Information Technology, Fudan University, Shanghai, China
H
Haojun Fei
AI Lab, Qifu Technology, Beijing, China
Q
Qing Yang
AI Lab, Qifu Technology, Beijing, China
Yanwu Xu
Yanwu Xu
South China University of Technology, Baidu, CVTE, I2R, NTU, USTC
Ophthalmic Image AnalysisMedical Image AnalysisMedical IntelligenceHealthcare Data Analysis
Tao Chen
Tao Chen
Fudan University
Deep LearningMedical Image Segmentation