ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of an end-to-end evaluation benchmark and semantics-aware assessment framework for structured information extraction from PDFs under enterprise-grade, complex JSON schemas. We introduce ExtractBench, the first open-source benchmark comprising 35 high-value economic-domain PDF documents, human-annotated JSON schemas, and 12,867 evaluable fields. It features a novel fine-grained evaluation framework that treats JSON schemas as executable specifications, enabling field-level differentiated scoring—including exact match, tolerance-based, and semantic equivalence—and explicitly distinguishing omissions from hallucinations. Experiments on leading large language models (e.g., GPT-5/5.2, Gemini-3, Claude 4.5) reveal significant performance degradation in broad-schema scenarios, with effective output rates dropping to 0% on a 369-field financial statement schema, underscoring the current models’ severe unreliability in complex structured extraction tasks.

Technology Category

Application Category

📝 Abstract
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.
Problem

Research questions and friction points this paper is trying to address.

structured extraction
PDF-to-JSON
evaluation benchmark
nested schema
LLM reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured extraction
evaluation framework
JSON Schema
nested data
LLM benchmarking
🔎 Similar Papers
No similar papers found.
N
Nick Ferguson
Contextual AI, Mountain View, CA, USA
J
Josh Pennington
Contextual AI, Mountain View, CA, USA
N
Narek Beghian
Contextual AI, Mountain View, CA, USA
Aravind Mohan
Aravind Mohan
Assistant Professor at McMurry University
Big Data WorkflowsBig Data ManagementCloud ComputingInformation Retrieval
Douwe Kiela
Douwe Kiela
Contextual AI, Stanford University
Natural Language ProcessingMachine LearningArtificial Intelligence
Sheshansh Agrawal
Sheshansh Agrawal
Microsoft Research
Recommendation Systems
T
Thien Hang Nguyen
Contextual AI, Mountain View, CA, USA