Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing PDF table extraction evaluation methods, which rely on structural metrics and fail to capture semantic equivalence. The authors propose the first benchmark framework leveraging synthetically generated PDFs paired with LaTeX-derived ground truth, introducing an LLM-as-a-judge approach for semantic-level table evaluation. Their methodology integrates a matching pipeline to reconcile parser output discrepancies and combines synthetic data generation, LLM-based semantic scoring, Tree Edit Distance (TEDS), Grid Table Similarity (GriTS), and human validation. Evaluated on over 1,500 human-annotated samples, the LLM-based assessment achieves a Pearson correlation of 0.93 with human judgments—substantially outperforming TEDS (0.68) and GriTS (0.70). Benchmarking 21 parsers on 451 tables reveals significant performance variations, establishing a reproducible and scalable paradigm for semantic table evaluation.

Technology Category

Application Category

📝 Abstract
Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study
Problem

Research questions and friction points this paper is trying to address.

PDF table extraction
semantic evaluation
benchmarking
LLM-as-a-judge
scientific data mining
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge
semantic table evaluation
PDF table extraction
synthetic PDF benchmark
human-aligned metric
🔎 Similar Papers
No similar papers found.
P
Pius Horn
Institute for Machine Learning and Analytics (IMLA), Offenburg University, Offenburg, Germany
Janis Keuper
Janis Keuper
Institute for Machine Learning and Analytics (IMLA), Offenburg University, Germany
Pattern RecognitionComputer VisionGeophysics