Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing general-purpose capabilities with domain-specific performance in multimodal large language models (MLLMs), this paper introduces the Qianfan-VL series of vision-language foundation models. Methodologically, we propose a domain-enhanced training paradigm integrated with a long-chain reasoning mechanism, implemented via multi-stage progressive training, high-fidelity synthetic data construction, and a large-scale, highly efficient training framework optimized for the Kunlun P800 AI chip. This enables joint optimization of broad generalization and specialized competencies—including OCR, document understanding, and mathematical/logical reasoning. Experiments demonstrate state-of-the-art results: 94.75% accuracy on DocVQA, 78.6% on MathVista, and consistent leadership across CC-Bench, SEED-Bench-IMG, and ScienceQA. Scalability is validated with >90% weak scaling efficiency at the thousand-GPU scale. Our core contribution lies in the first principled integration of domain enhancement and long-chain reasoning, enabling robust, enterprise-grade deployment across diverse multimodal applications.

Technology Category

Application Category

📝 Abstract
We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong general performance. Qianfan-VL achieves comparable results to leading open-source models on general benchmarks, with state-of-the-art performance on benchmarks such as CCBench, SEEDBench IMG, ScienceQA, and MMStar. The domain enhancement strategy delivers significant advantages in OCR and document understanding, validated on both public benchmarks (OCRBench 873, DocVQA 94.75%) and in-house evaluations. Notably, Qianfan-VL-8B and 70B variants incorporate long chain-of-thought capabilities, demonstrating superior performance on mathematical reasoning (MathVista 78.6%) and logical inference tasks. All models are trained entirely on Baidu's Kunlun P800 chips, validating the capability of large-scale AI infrastructure to train SOTA-level multimodal models with over 90% scaling efficiency on 5000 chips for a single task. This work establishes an effective methodology for developing domain-enhanced multimodal models suitable for diverse enterprise deployment scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing domain-specific capabilities while maintaining general performance in vision-language models
Developing multimodal models with superior OCR, document understanding and mathematical reasoning
Validating large-scale AI infrastructure for training state-of-the-art multimodal models efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage progressive training methodology
High-precision data synthesis pipelines
Domain enhancement strategy for OCR
🔎 Similar Papers
No similar papers found.
Daxiang Dong
Daxiang Dong
Baidu
Deep Learning、Natural Language Processing、Data Mining
M
Mingming Zheng
D
Dong Xu
B
Bairong Zhuang
W
Wenyu Zhang
C
Chunhua Luo
H
Haoran Wang
Z
Zijian Zhao
J
Jie Li
Y
Yuxuan Li
H
Hanjun Zhong
M
Mengyue Liu
J
Jieting Chen
S
Shupeng Li
L
Lun Tian
Y
Yaping Feng
X
Xin Li
D
Donggang Jiang
Y
Yong Chen
Y
Yehua Xu
D
Duohao Qin
C
Chen Feng
D
Dan Wang
H
Henghua Zhang
J
Jingjing Ha