Uni-Parser Technical Report

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-modal document parsing for scientific literature and patents faces inherent trade-offs among accuracy, throughput, and scalability. To address this, we propose an industrial-grade cross-modal parsing engine featuring a novel loosely coupled multi-expert modular architecture, enabling fine-grained, aligned parsing of text, mathematical formulas, tables, figures, and chemical structures. We design an adaptive GPU load-balancing strategy and a distributed inference framework to support joint multi-modal parsing and on-demand mode switching. Furthermore, the engine integrates cross-modal alignment modeling with a configurable parsing-mode engine. Evaluated on a cluster with eight RTX 4090D GPUs, it achieves a throughput of 20 PDF pages per second, enabling scalable deployment across billions of pages. The system significantly enhances downstream tasks—including scientific literature retrieval, chemical structure extraction, and AI4Science dataset construction—demonstrating both robustness and extensibility in real-world academic and industrial settings.

Technology Category

Application Category

📝 Abstract
This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.
Problem

Research questions and friction points this paper is trying to address.

Parses scientific and patent documents with high throughput and accuracy
Maintains cross-modal alignments across text, equations, tables, and figures
Enables scalable extraction for downstream AI and science applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular multi-expert architecture for cross-modal alignment
Adaptive GPU load balancing with distributed inference
Optimized cloud deployment achieving 20 PDF pages per second
🔎 Similar Papers
No similar papers found.
X
Xi Fang
DP Technology
H
Haoyi Tao
DP Technology
Shuwen Yang
Shuwen Yang
East China Norm University
Visual question answeringvisual reasoning
S
Suyang Zhong
DP Technology
H
Haocheng Lu
DP Technology
H
Han Lyu
DP Technology
C
Chaozheng Huang
DP Technology
X
Xinyu Li
DP Technology
Linfeng Zhang
Linfeng Zhang
DP Technology; AI for Science Institute
AI for Sciencemulti-scale modelingmolecular simulationdrug/materials design
Guolin Ke
Guolin Ke
DP Technology
Machine LearningAI for Science