🤖 AI Summary
Existing document parsing benchmarks suffer from annotation errors, data contamination, and an exclusive focus on clean pages, limiting their ability to reflect real-world performance. This work proposes PureDocBench, the first document parsing benchmark generated programmatically from HTML/CSS with fully traceable source code, spanning 10 domains and 66 subcategories. It comprises 4,425 images under clean, synthetically degraded, and realistically degraded conditions, with data quality ensured through a rigorous three-stage auditing process. A systematic evaluation of 40 models reveals that the current best model achieves only around 74/100; specialized small models (≤4B parameters) match or surpass general-purpose vision-language models up to 5–100 times larger; all models score below 67% on formula recognition; and performance rankings shift substantially under degradation, underscoring the necessity of multi-condition evaluation.
📝 Abstract
The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.