Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing PDF parsing benchmarks largely neglect mathematical formulae or lack semantic-aware evaluation, hindering scientific document understanding and large language model training. Method: We propose the first benchmark framework specifically for mathematical formula extraction from PDFs, built upon synthetically generated PDFs with precise LaTeX ground truth, covering diverse formula structures and layouts. We introduce an LLM-as-a-judge paradigm for semantic evaluation, featuring a two-stage robust matching pipeline—structural alignment followed by semantic equivalence verification—and validate its high agreement with human judgments (Pearson’s *r* = 0.78) via large-scale manual annotation. Contribution/Results: We systematically evaluate over 20 state-of-the-art PDF parsers on 100 synthetic documents containing 2,000+ formulae, revealing substantial performance disparities. We publicly release the benchmark dataset, evaluation code, and protocols to advance math-aware document parsing research.

Technology Category

Application Category

📝 Abstract

Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench

Problem

Research questions and friction points this paper is trying to address.

Evaluates PDF parsers for extracting mathematical formulas accurately.

Introduces synthetic PDFs with LaTeX ground truth for benchmarking.

Uses LLM-as-a-judge to assess semantic formula extraction quality.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic PDFs with LaTeX ground truth for benchmarking

LLM-as-a-judge for semantic formula evaluation

Two-stage matching pipeline handling parser inconsistencies

🔎 Similar Papers

LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement