PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in current large-model OCR systems—namely high computational overhead, inaccurate text localization in complex layouts, and text hallucination—while highlighting their excessive reliance on model scale. The authors propose PP-OCRv5, a lightweight OCR system with only 5 million parameters, which deliberately departs from the “bigger-is-better” paradigm. By leveraging a data-driven strategy emphasizing high-quality, high-difficulty, and highly diverse training data, PP-OCRv5 achieves performance on par with billion-parameter vision-language models on standard OCR benchmarks. The system retains a traditional yet efficient two-stage pipeline and incorporates systematic optimizations along three data dimensions: difficulty, accuracy, and diversity. Extensive evaluations demonstrate its superior text localization accuracy and significantly reduced hallucination rates across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
Problem

Research questions and friction points this paper is trying to address.

OCR
vision-language models
text localization
hallucination
lightweight model
Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight OCR
data-centric optimization
text localization
hallucination reduction
two-stage OCR pipeline
🔎 Similar Papers
No similar papers found.
Cheng Cui
Cheng Cui
BUAA
deep learningnetwork designOCRmllm
Y
Yubo Zhang
PaddlePaddle Team, Baidu Inc.
T
Ting Sun
PaddlePaddle Team, Baidu Inc.
X
Xueqing Wang
PaddlePaddle Team, Baidu Inc.
H
Hongen Liu
PaddlePaddle Team, Baidu Inc.
M
Manhui Lin
PaddlePaddle Team, Baidu Inc.
Y
Yue Zhang
PaddlePaddle Team, Baidu Inc.
T
Tingquan Gao
PaddlePaddle Team, Baidu Inc.
C
Changda Zhou
PaddlePaddle Team, Baidu Inc.
Jiaxuan Liu
Jiaxuan Liu
University of Science and Technology of China
Text-to-SpeechSpeech LLMAGI
Z
Zelun Zhang
PaddlePaddle Team, Baidu Inc.
J
Jing Zhang
PaddlePaddle Team, Baidu Inc.
J
Jun Zhang
PaddlePaddle Team, Baidu Inc.
Yi Liu
Yi Liu
Baidu Inc.
CVLLMVLM