PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges of low data quality and inefficient inference in multimodal understanding of Chinese business documents, this paper proposes a synergistic optimization framework for high-quality synthetic data generation and efficient visual reasoning. First, a statistical filtering mechanism, guided by large-model-based evaluation, automatically removes low-quality synthetic samples. Second, ViT intermediate-layer representations are decoupled, and a lightweight cross-layer feature fusion module is introduced to enhance fine-grained semantic modeling. Third, quantization-aware inference is integrated to reduce computational overhead. Evaluated on a newly constructed Chinese business document benchmark, the proposed method achieves a 11.4% accuracy improvement and a 73.0% reduction in inference latency over state-of-the-art baselines. This work establishes a scalable technical pathway toward high-accuracy, low-latency intelligent parsing of Chinese business documents.

Technology Category

Application Category

📝 Abstract
This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal document understanding efficiency
Improves synthetic data quality and feature fusion
Reduces inference latency for document processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced synthetic data quality optimization
Improved visual feature fusion strategy
Optimized inference methodologies efficiency
🔎 Similar Papers
No similar papers found.
Kui Huang
Kui Huang
baidu
X
Xinrong Chen
Baidu Inc.
W
Wenyu Lv
Baidu Inc.
Jincheng Liao
Jincheng Liao
ECNU
MLLM
G
Guanzhong Wang
Baidu Inc.
Y
Yi Liu
Baidu Inc.