Uni-Parser Technical Report

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Multi-modal document parsing for scientific literature and patents faces inherent trade-offs among accuracy, throughput, and scalability. To address this, we propose an industrial-grade cross-modal parsing engine featuring a novel loosely coupled multi-expert modular architecture, enabling fine-grained, aligned parsing of text, mathematical formulas, tables, figures, and chemical structures. We design an adaptive GPU load-balancing strategy and a distributed inference framework to support joint multi-modal parsing and on-demand mode switching. Furthermore, the engine integrates cross-modal alignment modeling with a configurable parsing-mode engine. Evaluated on a cluster with eight RTX 4090D GPUs, it achieves a throughput of 20 PDF pages per second, enabling scalable deployment across billions of pages. The system significantly enhances downstream tasks—including scientific literature retrieval, chemical structure extraction, and AI4Science dataset construction—demonstrating both robustness and extensibility in real-world academic and industrial settings.

Technology Category

Application Category

📝 Abstract

This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.

Problem

Research questions and friction points this paper is trying to address.

Parses scientific and patent documents with high throughput and accuracy

Maintains cross-modal alignments across text, equations, tables, and figures

Enables scalable extraction for downstream AI and science applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular multi-expert architecture for cross-modal alignment

Adaptive GPU load balancing with distributed inference

Optimized cloud deployment achieving 20 PDF pages per second

🔎 Similar Papers

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing