MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in current document parsing methods, which exhibit consistent failure patterns on challenging samples—suggesting that performance bottlenecks stem from deficiencies in training data rather than model architecture. To overcome this, the authors propose a data-centric parsing paradigm that retains a fixed 1.2B-parameter model while constructing a large-scale training set characterized by high coverage, rich information content, and precise annotations. They introduce a three-stage progressive training strategy: large-scale pretraining, hard-sample fine-tuning, and GRPO-based alignment. Key innovations include diversity- and difficulty-aware sampling, cross-model consistency validation, and an iterative render-and-verify labeling pipeline. Evaluated on OmniDocBench v1.6, the approach achieves a score of 95.69, surpassing the same-architecture baseline by 2.71 points and outperforming the previous state-of-the-art method despite using a model with over 200 times fewer parameters.
📝 Abstract
Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.
Problem

Research questions and friction points this paper is trying to address.

document parsing
training data deficiency
data-centric learning
hard samples
performance bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

data-centric learning
document parsing
hard sample mining
cross-model consistency verification
progressive training strategy
🔎 Similar Papers
No similar papers found.
Bin Wang
Bin Wang
Pengcheng Laboratory
Cloud ComputingIIoTGreen ComputingComputer Architecture
Tianyao He
Tianyao He
Shanghai Jiao Tong University
computer Vision
L
Linke Ouyang
Shanghai Artificial Intelligence Laboratory
Fan Wu
Fan Wu
Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Wireless NetworkingMobile ComputingAlgorithmic Game Theory and Its Applications
Z
Zhiyuan Zhao
Shanghai Artificial Intelligence Laboratory
Tao Chu
Tao Chu
SCUT
Y
Yuan Qu
Shanghai Artificial Intelligence Laboratory
Z
Zhenjiang Jin
Shanghai Artificial Intelligence Laboratory
W
Weijun Zeng
Shanghai Artificial Intelligence Laboratory
Ziyang Miao
Ziyang Miao
Beihang University
B
Bangrui Xu
Shanghai Artificial Intelligence Laboratory
Junbo Niu
Junbo Niu
Peking University
Foundation Model
M
Mengzhang Cai
Shanghai Artificial Intelligence Laboratory
Jiantao Qiu
Jiantao Qiu
EE department of Tsinghua University
Q
Qintong Zhang
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
D
Dongsheng Ma
Shanghai Artificial Intelligence Laboratory
Y
Yuefeng Sun
Shanghai Artificial Intelligence Laboratory
H
Hejun Dong
Shanghai Artificial Intelligence Laboratory
Wenzheng Zhang
Wenzheng Zhang
Rutgers University
Natural Language ProcessingDeep Learning
J
Jutao Xiao
Shanghai Artificial Intelligence Laboratory
J
Jiayong Shi
Shanghai Artificial Intelligence Laboratory
P
Pengyu Liao
Shanghai Artificial Intelligence Laboratory
X
Xiaomeng Zhao
Shanghai Artificial Intelligence Laboratory
Huaping Zhong
Huaping Zhong
SenseTime Group Limited
L
Liqun Wei
Shanghai Artificial Intelligence Laboratory