Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end document parsing methods often suffer from repetitions, hallucinations, and structural inconsistencies in real-world casually captured scenes, primarily due to the scarcity of high-quality full-page supervision data and the absence of structure-aware training mechanisms. This work proposes a co-optimized framework that synergistically enhances data synthesis and model training: it constructs large-scale, structurally diverse full-page supervision data through realistic synthesis based on layout templates and document element composition, and introduces progressive structure-aware training alongside structural token optimization within a billion-parameter multimodal large language model. The proposed approach achieves, for the first time, highly robust end-to-end parsing of real-world captured documents, significantly outperforming existing methods across scanned, digitally generated, and in-the-wild capture scenarios. The project also releases the model, data synthesis pipeline, and a new evaluation benchmark, Wild-OmniDocBench.

Technology Category

Application Category

📝 Abstract
Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
Problem

Research questions and friction points this paper is trying to address.

document parsing
end-to-end parsing
structured output
data scarcity
structural inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Realistic Scene Synthesis
Document-Aware Training
End-to-End Document Parsing
Structure-Token Optimization
Wild-OmniDocBench
🔎 Similar Papers
No similar papers found.