How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Existing document parsing benchmarks suffer from annotation errors, data contamination, and an exclusive focus on clean pages, limiting their ability to reflect real-world performance. This work proposes PureDocBench, the first document parsing benchmark generated programmatically from HTML/CSS with fully traceable source code, spanning 10 domains and 66 subcategories. It comprises 4,425 images under clean, synthetically degraded, and realistically degraded conditions, with data quality ensured through a rigorous three-stage auditing process. A systematic evaluation of 40 models reveals that the current best model achieves only around 74/100; specialized small models (≤4B parameters) match or surpass general-purpose vision-language models up to 5–100 times larger; all models score below 67% on formula recognition; and performance rankings shift substantially under degradation, underscoring the necessity of multi-condition evaluation.

📝 Abstract

The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.

Problem

Research questions and friction points this paper is trying to address.

document parsing

benchmark

annotation quality

data contamination

degraded documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

document parsing

benchmark

source-traceable