🤖 AI Summary
To address challenges of low data quality and inefficient inference in multimodal understanding of Chinese business documents, this paper proposes a synergistic optimization framework for high-quality synthetic data generation and efficient visual reasoning. First, a statistical filtering mechanism, guided by large-model-based evaluation, automatically removes low-quality synthetic samples. Second, ViT intermediate-layer representations are decoupled, and a lightweight cross-layer feature fusion module is introduced to enhance fine-grained semantic modeling. Third, quantization-aware inference is integrated to reduce computational overhead. Evaluated on a newly constructed Chinese business document benchmark, the proposed method achieves a 11.4% accuracy improvement and a 73.0% reduction in inference latency over state-of-the-art baselines. This work establishes a scalable technical pathway toward high-accuracy, low-latency intelligent parsing of Chinese business documents.
📝 Abstract
This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.