PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address challenges of low data quality and inefficient inference in multimodal understanding of Chinese business documents, this paper proposes a synergistic optimization framework for high-quality synthetic data generation and efficient visual reasoning. First, a statistical filtering mechanism, guided by large-model-based evaluation, automatically removes low-quality synthetic samples. Second, ViT intermediate-layer representations are decoupled, and a lightweight cross-layer feature fusion module is introduced to enhance fine-grained semantic modeling. Third, quantization-aware inference is integrated to reduce computational overhead. Evaluated on a newly constructed Chinese business document benchmark, the proposed method achieves a 11.4% accuracy improvement and a 73.0% reduction in inference latency over state-of-the-art baselines. This work establishes a scalable technical pathway toward high-accuracy, low-latency intelligent parsing of Chinese business documents.

Technology Category

Application Category

📝 Abstract

This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.

Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal document understanding efficiency

Improves synthetic data quality and feature fusion

Reduces inference latency for document processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced synthetic data quality optimization

Improved visual feature fusion strategy

Optimized inference methodologies efficiency

🔎 Similar Papers

Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review