Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of visual-language models under real-world physical conditions, despite their strong performance on digital document benchmarks. We introduce the first physical reconstruction benchmark encompassing five realistic perturbations: scanning artifacts, page curvature, screen capture distortions, lighting variations, and perspective tilts. By achieving one-to-one high-fidelity physical re-creation of all images in OmniDocBench v1.5, we establish precise correspondences between digital and physical samples. Through physics-based simulation, controlled ablation studies, and cross-domain alignment mapping, we enable fine-grained attribution of performance degradation to specific factors—geometric distortions, optical artifacts, or inherent model limitations—at full scale. Our benchmark reveals substantial performance gaps of current document understanding models in real-world settings and provides a diagnostic tool and standardized challenge for developing robust document intelligence systems.

Technology Category

Application Category

📝 Abstract

While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.

Problem

Research questions and friction points this paper is trying to address.

document parsing

reality gap

physical reconstruction

robustness

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

physical reconstruction

document parsing

vision-language models