๐ค AI Summary
Real-world documents frequently contain multi-level tables, embedded images/formulas, and cross-page structures, posing significant challenges for existing OCR systems due to insufficient robustness. To address this, we propose a unified vision-language framework featuring a novel two-stage parsing pipeline: (1) a first stage employing image disentanglement and type-guided merging to improve structural fidelity in complex table reconstruction; (2) a second stage leveraging a large multimodal model to jointly predict document layout and reading order, enhanced by a render-and-compare alignment strategy for precise region-wise text, formula, and table recognition. We further introduce a vision-consistency reinforcement learning metric to evaluate and refine recognition quality, substantially improving cross-page and multimodal content handling. Our method achieves state-of-the-art performance on OmniDocBench v1.5, significantly outperforming PPOCR-VL and MinerU 2.5โparticularly excelling on visually complex documents.
๐ Abstract
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.