MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Real-world documents frequently contain multi-level tables, embedded images/formulas, and cross-page structures, posing significant challenges for existing OCR systems due to insufficient robustness. To address this, we propose a unified vision-language framework featuring a novel two-stage parsing pipeline: (1) a first stage employing image disentanglement and type-guided merging to improve structural fidelity in complex table reconstruction; (2) a second stage leveraging a large multimodal model to jointly predict document layout and reading order, enhanced by a render-and-compare alignment strategy for precise region-wise text, formula, and table recognition. We further introduce a vision-consistency reinforcement learning metric to evaluate and refine recognition quality, substantially improving cross-page and multimodal content handling. Our method achieves state-of-the-art performance on OmniDocBench v1.5, significantly outperforming PPOCR-VL and MinerU 2.5—particularly excelling on visually complex documents.

Technology Category

Application Category

📝 Abstract

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addressing complex document layouts with multi-level tables and embedded elements

Improving OCR accuracy for cross-page structures and visual consistency

Enhancing table parsing with embedded images and multi-column reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage parsing pipeline for layout and content

Reinforcement learning with render-and-compare alignment

Specialized modules for embedded images and table merging

🔎 Similar Papers

No similar papers found.