PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency and high computational cost of recognizing complex layout elements (e.g., text, tables, formulas, figures) in multilingual document parsing, this paper proposes a hyper-compact vision-language model. Methodologically, it innovatively integrates a NaViT-style dynamic-resolution visual encoder with a lightweight ERNIE-4.5-0.3B language model, augmented by multi-task learning and knowledge distillation to achieve efficient multimodal understanding under strict parameter constraints. The model supports 109 languages and achieves state-of-the-art performance on both page-level parsing and layout element recognition. It significantly outperforms existing methods on public and internal benchmarks, while exhibiting fast inference speed and low GPU memory consumption—demonstrating strong suitability for large-scale deployment in real-world scenarios.

Technology Category

Application Category

📝 Abstract
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Developing a compact vision-language model for multilingual document parsing
Recognizing complex elements like text, tables, formulas and charts
Achieving efficient performance with minimal resource consumption requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ultra-compact 0.9B vision-language model for parsing
Dynamic resolution visual encoder with language model
Supports 109 languages with minimal resource consumption
🔎 Similar Papers
Cheng Cui
Cheng Cui
BUAA
deep learningnetwork designOCRmllm
T
Ting Sun
PaddlePaddle Team, Baidu Inc.
S
Suyin Liang
PaddlePaddle Team, Baidu Inc.
T
Tingquan Gao
PaddlePaddle Team, Baidu Inc.
Z
Zelun Zhang
PaddlePaddle Team, Baidu Inc.
Jiaxuan Liu
Jiaxuan Liu
University of Science and Technology of China
Text-to-SpeechSpeech LLMAGI
X
Xueqing Wang
PaddlePaddle Team, Baidu Inc.
C
Changda Zhou
PaddlePaddle Team, Baidu Inc.
H
Hongen Liu
PaddlePaddle Team, Baidu Inc.
M
Manhui Lin
PaddlePaddle Team, Baidu Inc.
Y
Yue Zhang
PaddlePaddle Team, Baidu Inc.
Y
Yubo Zhang
PaddlePaddle Team, Baidu Inc.
Handong Zheng
Handong Zheng
Unknown affiliation
J
Jing Zhang
PaddlePaddle Team, Baidu Inc.
J
Jun Zhang
PaddlePaddle Team, Baidu Inc.
Y
Yi Liu
PaddlePaddle Team, Baidu Inc.
Dianhai Yu
Dianhai Yu
Baidu
Deep LearningNatural Language ProcessingMachine LearningArtificial intelligence
Y
Yanjun Ma
PaddlePaddle Team, Baidu Inc.