Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the limitations of existing PDF table extraction evaluation methods, which rely on structural metrics and fail to capture semantic equivalence. The authors propose the first benchmark framework leveraging synthetically generated PDFs paired with LaTeX-derived ground truth, introducing an LLM-as-a-judge approach for semantic-level table evaluation. Their methodology integrates a matching pipeline to reconcile parser output discrepancies and combines synthetic data generation, LLM-based semantic scoring, Tree Edit Distance (TEDS), Grid Table Similarity (GriTS), and human validation. Evaluated on over 1,500 human-annotated samples, the LLM-based assessment achieves a Pearson correlation of 0.93 with human judgments—substantially outperforming TEDS (0.68) and GriTS (0.70). Benchmarking 21 parsers on 451 tables reveals significant performance variations, establishing a reproducible and scalable paradigm for semantic table evaluation.

Technology Category

Application Category

📝 Abstract

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

Problem

Research questions and friction points this paper is trying to address.

PDF table extraction

semantic evaluation

benchmarking

LLM-as-a-judge

scientific data mining

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge

semantic table evaluation

PDF table extraction