READoc: A Unified Benchmark for Realistic Document Structured Extraction

📅 2024-09-08
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Document structure extraction (DSE) lacks a unified, real-world-oriented evaluation benchmark, hindering progress due to fragmented and narrow existing evaluations. Method: We introduce READoc—the first end-to-end DSE benchmark—comprising 2,233 authentic PDFs from arXiv and GitHub, centered on the task of “PDF → semantically rich Markdown.” We propose the DSE Evaluation S³uite, featuring Standardization, Segmentation, and Scoring, grounded in three core principles: end-to-end processing, semantic completeness, and formatting fidelity. High-quality annotations are derived from real PDF parsing to support multi-granularity structural recognition, cross-modal alignment, and interpretable scoring. Contribution/Results: Comprehensive evaluation of state-of-the-art methods reveals systematic deficiencies in hierarchical heading recovery, formula/table preservation, and citation consistency. READoc and the S³uite are open-sourced, advancing DSE toward practical deployment and standardized assessment.

Technology Category

Application Category

📝 Abstract
Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field's advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 2,233 diverse and real-world documents from arXiv and GitHub. In addition, we develop a DSE Evaluation S$^3$uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
Problem

Research questions and friction points this paper is trying to address.

Lack of unified evaluation for Document Structured Extraction systems
Fragmented benchmarks hinder DSE field advancement
Need realistic PDF-to-Markdown conversion standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces READoc benchmark for document extraction
Uses diverse real-world PDFs for dataset
Develops S$^3$uite for unified evaluation
🔎 Similar Papers
No similar papers found.
Z
Zichao Li
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
A
Aizier Abulaiti
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
Xuanang Chen
Xuanang Chen
Institute of Software, Chinese Academy of Sciences
Information RetrievalNatural Language Processing
J
Jia Zheng
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing