Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Document layout-structured parsing from scanned images remains hindered by error propagation across multi-stage pipelines and poor generalization to diverse layouts. This paper proposes LayoutRL, an end-to-end layout-aware reinforcement learning framework that explicitly models document layout structure for the first time. It introduces a composite reward function integrating edit distance, paragraph count, and reading-order fidelity. We also construct Infinity-Doc-55K—the first large-scale benchmark unifying synthetically generated and real-world document images. LayoutRL leverages a unified vision-language model (VLM) to jointly handle OCR, table/formula detection, and reading-order inference, optimizing the parsing policy via policy gradient methods. Extensive experiments demonstrate state-of-the-art performance across English/Chinese OCR, table recognition, formula localization, and reading-order prediction. LayoutRL achieves significantly higher structural fidelity and accuracy than both specialized pipeline systems and generic VLMs.

Technology Category

Application Category

📝 Abstract

Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

Problem

Research questions and friction points this paper is trying to address.

Automated parsing of scanned documents into structured formats

Error propagation in traditional multi-stage document parsing pipelines

Limited adaptability to diverse document layouts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layout-aware reinforcement learning framework

Vision-language-model-based parser

Composite reward optimization strategy

🔎 Similar Papers

No similar papers found.