🤖 AI Summary
This work addresses the absence of an end-to-end evaluation benchmark and semantics-aware assessment framework for structured information extraction from PDFs under enterprise-grade, complex JSON schemas. We introduce ExtractBench, the first open-source benchmark comprising 35 high-value economic-domain PDF documents, human-annotated JSON schemas, and 12,867 evaluable fields. It features a novel fine-grained evaluation framework that treats JSON schemas as executable specifications, enabling field-level differentiated scoring—including exact match, tolerance-based, and semantic equivalence—and explicitly distinguishing omissions from hallucinations. Experiments on leading large language models (e.g., GPT-5/5.2, Gemini-3, Claude 4.5) reveal significant performance degradation in broad-schema scenarios, with effective output rates dropping to 0% on a 369-field financial statement schema, underscoring the current models’ severe unreliability in complex structured extraction tasks.
📝 Abstract
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.