MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing OCR systems based on vision-language models (VLMs) struggle to reconstruct full-document logical structures—such as paragraphs and tables split across pages—thereby limiting the performance of downstream tasks like retrieval-augmented generation (RAG). This work proposes MinerU-Popo, a lightweight and general-purpose OCR post-processing framework that transforms page-level OCR outputs into coherent full-document structures through four subtasks: text and table truncation recovery, heading hierarchy reconstruction, and figure-text alignment. The approach innovatively integrates a task-oriented data engine, a dynamic overlapping chunking synchronization mechanism, and a tree-based document representation, leveraging a fine-tuned Qwen3-VL-4B model for efficient inference. Experiments demonstrate that MinerU-Popo improves heading hierarchy TEDS scores by over 20% on average across five OCR models, significantly boosting RAG accuracy while reducing query latency.

📝 Abstract

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

Problem

Research questions and friction points this paper is trying to address.

document parsing

cross-page continuity

structure recovery

OCR post-processing

logical structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-processing

document-level parsing

cross-page structure recovery