MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

๐Ÿ“… 2025-11-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Real-world documents frequently contain multi-level tables, embedded images/formulas, and cross-page structures, posing significant challenges for existing OCR systems due to insufficient robustness. To address this, we propose a unified vision-language framework featuring a novel two-stage parsing pipeline: (1) a first stage employing image disentanglement and type-guided merging to improve structural fidelity in complex table reconstruction; (2) a second stage leveraging a large multimodal model to jointly predict document layout and reading order, enhanced by a render-and-compare alignment strategy for precise region-wise text, formula, and table recognition. We further introduce a vision-consistency reinforcement learning metric to evaluate and refine recognition quality, substantially improving cross-page and multimodal content handling. Our method achieves state-of-the-art performance on OmniDocBench v1.5, significantly outperforming PPOCR-VL and MinerU 2.5โ€”particularly excelling on visually complex documents.

Technology Category

Application Category

๐Ÿ“ Abstract
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addressing complex document layouts with multi-level tables and embedded elements
Improving OCR accuracy for cross-page structures and visual consistency
Enhancing table parsing with embedded images and multi-column reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage parsing pipeline for layout and content
Reinforcement learning with render-and-compare alignment
Specialized modules for embedded images and table merging
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiarui Zhang
KingSoft Office Zhuiguang AI Lab
Y
Yuliang Liu
Huazhong University of Science and Technology
Zijun Wu
Zijun Wu
University of Alberta
Natural Language Processing (NLP)
G
Guosheng Pang
KingSoft Office Zhuiguang AI Lab
Z
Zhili Ye
KingSoft Office Zhuiguang AI Lab
Y
Yupei Zhong
KingSoft Office Zhuiguang AI Lab
J
Junteng Ma
KingSoft Office Zhuiguang AI Lab
T
Tao Wei
KingSoft Office Zhuiguang AI Lab
H
Haiyang Xu
KingSoft Office Zhuiguang AI Lab
Weikai Chen
Weikai Chen
Principal Research Scientist, Tencent America
3D AIGC3D VisionComputer graphicsVLM
Z
Zeen Wang
KingSoft Office Zhuiguang AI Lab
Q
Qiangjun Ji
KingSoft Office Zhuiguang AI Lab
F
Fanxi Zhou
KingSoft Office Zhuiguang AI Lab
Q
Qi Zhang
KingSoft Office Zhuiguang AI Lab
Y
Yuanrui Hu
KingSoft Office Zhuiguang AI Lab
J
Jiahao Liu
KingSoft Office Zhuiguang AI Lab
Z
Zhang Li
Huazhong University of Science and Technology
Z
Ziyang Zhang
Huazhong University of Science and Technology
Q
Qiang Liu
KingSoft Office Zhuiguang AI Lab
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR