MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-resolution document parsing faces a fundamental trade-off between fine-grained recognition (e.g., dense text, mathematical formulas, tables) and computational efficiency. Method: We propose a coarse-to-fine, two-stage decoupled vision-language model: Stage 1 performs global layout analysis on downsampled low-resolution images; Stage 2 executes fine-grained content recognition—text, formulas, tables—within original-resolution local regions localized by the layout. Crucially, layout and content modeling are disentangled, enabling downsampling for acceleration while preserving high-fidelity details in targeted regions. The framework integrates vision-language joint modeling, synthetic data augmentation, and adaptive region cropping. Results: Our method achieves Pareto-optimal accuracy–efficiency trade-offs, outperforming both general-purpose and domain-specific models on multiple standard benchmarks with significantly lower inference cost. It demonstrates strong generalization across diverse document types and layouts.

Technology Category

Application Category

📝 Abstract
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Efficient high-resolution document parsing with decoupled vision-language model
Two-stage strategy separates layout analysis from content recognition
Achieves state-of-the-art accuracy while reducing computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage parsing decouples layout and content
Layout analysis on downsampled images for efficiency
Native-resolution crops preserve fine-grained details
🔎 Similar Papers
No similar papers found.
Junbo Niu
Junbo Niu
Peking University
Foundation Model
Z
Zheng Liu
Shanghai Artificial Intelligence Laboratory, Peking University
Z
Zhuangcheng Gu
Shanghai Artificial Intelligence Laboratory
B
Bin Wang
Shanghai Artificial Intelligence Laboratory
L
Linke Ouyang
Shanghai Artificial Intelligence Laboratory
Z
Zhiyuan Zhao
Shanghai Artificial Intelligence Laboratory
Tao Chu
Tao Chu
SCUT
Tianyao He
Tianyao He
Shanghai Jiao Tong University
computer Vision
F
Fan Wu
Shanghai Artificial Intelligence Laboratory
Q
Qintong Zhang
Shanghai Artificial Intelligence Laboratory, Peking University
Z
Zhenjiang Jin
Shanghai Artificial Intelligence Laboratory
Guang Liang
Guang Liang
Nanjing University
Deep learning architectures
R
Rui Zhang
Shanghai Artificial Intelligence Laboratory
Wenzheng Zhang
Wenzheng Zhang
Rutgers University
Natural Language ProcessingDeep Learning
Y
Yuan Qu
Shanghai Artificial Intelligence Laboratory
Z
Zhifei Ren
Shanghai Artificial Intelligence Laboratory
Y
Yuefeng Sun
Shanghai Artificial Intelligence Laboratory
Y
Yuanhong Zheng
Shanghai Artificial Intelligence Laboratory
D
Dongsheng Ma
Shanghai Artificial Intelligence Laboratory
Z
Zirui Tang
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
B
Boyu Niu
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
Ziyang Miao
Ziyang Miao
Beihang University
H
Hejun Dong
Shanghai Artificial Intelligence Laboratory
S
Siyi Qian
Shanghai Artificial Intelligence Laboratory, Peking University
J
Junyuan Zhang
Shanghai Artificial Intelligence Laboratory