Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

πŸ“… 2025-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Document layout-structured parsing from scanned images remains hindered by error propagation across multi-stage pipelines and poor generalization to diverse layouts. This paper proposes LayoutRL, an end-to-end layout-aware reinforcement learning framework that explicitly models document layout structure for the first time. It introduces a composite reward function integrating edit distance, paragraph count, and reading-order fidelity. We also construct Infinity-Doc-55Kβ€”the first large-scale benchmark unifying synthetically generated and real-world document images. LayoutRL leverages a unified vision-language model (VLM) to jointly handle OCR, table/formula detection, and reading-order inference, optimizing the parsing policy via policy gradient methods. Extensive experiments demonstrate state-of-the-art performance across English/Chinese OCR, table recognition, formula localization, and reading-order prediction. LayoutRL achieves significantly higher structural fidelity and accuracy than both specialized pipeline systems and generic VLMs.

Technology Category

Application Category

πŸ“ Abstract
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
Problem

Research questions and friction points this paper is trying to address.

Automated parsing of scanned documents into structured formats
Error propagation in traditional multi-stage document parsing pipelines
Limited adaptability to diverse document layouts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layout-aware reinforcement learning framework
Vision-language-model-based parser
Composite reward optimization strategy
πŸ”Ž Similar Papers
No similar papers found.
B
Baode Wang
INFLY Tech
B
Biao Wu
Australian Artificial Intelligence Institute
W
Weizhen Li
INFLY Tech
Meng Fang
Meng Fang
University of Liverpool
Natural Language ProcessingReinforcement LearningAgentsArtificial intelligence
Y
Yanjie Liang
INFLY Tech
Zuming Huang
Zuming Huang
Senior Algorithm Engineer, Ant Group
OCRDocument IntelligenceLarge Multimodal Models
H
Haozhe Wang
INFLY Tech
J
Jun Huang
INFLY Tech
L
Ling Chen
Australian Artificial Intelligence Institute
W
Wei Chu
INFLY Tech
Y
Yuan Qi
INFLY Tech