DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

Existing document parsing evaluation relies heavily on standardized benchmarks, suffers from dataset bias, and obscures fine-grained errors through holistic scoring. Method: We propose DOC-Inspector, the first fine-grained automated evaluation framework for document parsing. It introduces the novel “VLM-as-a-Judge” paradigm, leveraging vision-language models to detect 28 error types by jointly analyzing PDF page images and structured outputs; employs a Chain-of-Checklist reasoning mechanism for hierarchical quality analysis; and establishes DOCRcaseBench—a realistic-scenario benchmark—and DOCRcase-200K—a large-scale training dataset. Contribution/Results: We release the lightweight open-source model DOC-Inspector-7B, which significantly outperforms state-of-the-art models (e.g., Gemini 2.5 Pro) on 882 real-world cases. Its evaluations precisely localize error patterns, enabling targeted optimization of parsing systems.

Technology Category

Application Category

📝 Abstract

Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.

Problem

Research questions and friction points this paper is trying to address.

Assess document parsing quality reliably in real-world scenarios.

Detect fine-grained errors beyond overall benchmark scores.

Address dataset biases causing inconsistent model performance rankings.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VLM-as-a-Judge for fine-grained error detection

Introduces Chain-of-Checklist reasoning for hierarchical assessment

Constructs large-scale training dataset DOCRcase-200K for evaluation

🔎 Similar Papers

READoc: A Unified Benchmark for Realistic Document Structured Extraction