MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation benchmarks for document parsing under multilingual, multiscript, and real-world capture conditions—particularly for low-resource languages and non-ideal images. The authors introduce the first high-quality benchmark tailored to these challenging scenarios, encompassing 17 languages, diverse writing systems, and complex imaging conditions. A hybrid annotation pipeline combining expert model assistance with rigorous human verification ensures data fidelity, while a public/private split mitigates data leakage risks. Evaluation results reveal that closed-source models (e.g., Gemini3-Pro) exhibit robust performance, whereas open-source alternatives suffer significant degradation—14.0% on non-Latin scripts and 17.8% on captured documents—highlighting critical gaps in linguistic inclusivity and real-world deployment readiness of current systems.
📝 Abstract
We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.
Problem

Research questions and friction points this paper is trying to address.

multilingual document parsing
real-world document images
low-resource languages
non-Latin scripts
document parsing benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual document parsing
real-world document benchmark
non-Latin scripts
photographed document analysis
model evaluation
🔎 Similar Papers
Z
Zhang Li
Huazhong University of Science and Technology
Z
Zhibo Lin
Huazhong University of Science and Technology
Q
Qiang Liu
Kingsoft Office
Z
Ziyang Zhang
Huazhong University of Science and Technology
S
Shuo Zhang
Huazhong University of Science and Technology
Z
Zidun Guo
Huazhong University of Science and Technology
Jiajun Song
Jiajun Song
Michigan technological University
Wave Energy Converter
J
Jiarui Zhang
Kingsoft Office
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
Y
Yuliang Liu
Huazhong University of Science and Technology