🤖 AI Summary
This work addresses the lack of systematic evaluation of visual-language models under real-world physical conditions, despite their strong performance on digital document benchmarks. We introduce the first physical reconstruction benchmark encompassing five realistic perturbations: scanning artifacts, page curvature, screen capture distortions, lighting variations, and perspective tilts. By achieving one-to-one high-fidelity physical re-creation of all images in OmniDocBench v1.5, we establish precise correspondences between digital and physical samples. Through physics-based simulation, controlled ablation studies, and cross-domain alignment mapping, we enable fine-grained attribution of performance degradation to specific factors—geometric distortions, optical artifacts, or inherent model limitations—at full scale. Our benchmark reveals substantial performance gaps of current document understanding models in real-world settings and provides a diagnostic tool and standardized challenge for developing robust document intelligence systems.
📝 Abstract
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.