Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of visual-language models under real-world physical conditions, despite their strong performance on digital document benchmarks. We introduce the first physical reconstruction benchmark encompassing five realistic perturbations: scanning artifacts, page curvature, screen capture distortions, lighting variations, and perspective tilts. By achieving one-to-one high-fidelity physical re-creation of all images in OmniDocBench v1.5, we establish precise correspondences between digital and physical samples. Through physics-based simulation, controlled ablation studies, and cross-domain alignment mapping, we enable fine-grained attribution of performance degradation to specific factors—geometric distortions, optical artifacts, or inherent model limitations—at full scale. Our benchmark reveals substantial performance gaps of current document understanding models in real-world settings and provides a diagnostic tool and standardized challenge for developing robust document intelligence systems.

Technology Category

Application Category

📝 Abstract
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.
Problem

Research questions and friction points this paper is trying to address.

document parsing
reality gap
physical reconstruction
robustness
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

physical reconstruction
document parsing
vision-language models
reality gap
benchmark
🔎 Similar Papers
C
Changda Zhou
PaddlePaddle Team, Baidu Inc.
Z
Ziyue Gao
PaddlePaddle Team, Baidu Inc.; Hong Kong University of Science and Technology (Guangzhou)
X
Xueqing Wang
PaddlePaddle Team, Baidu Inc.
T
Tingquan Gao
PaddlePaddle Team, Baidu Inc.
Cheng Cui
Cheng Cui
BUAA
deep learningnetwork designOCRmllm
Jing Tang
Jing Tang
The Hong Kong University of Science and Technology (Guangzhou)
Data ManagementSocial NetworksMachine LearningBlockchain
Yi Liu
Yi Liu
Baidu Inc.
CVLLMVLM