SCORE: A Semantic Evaluation Framework for Generative Document Parsing

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Traditional evaluation metrics (e.g., CER, WER, IoU, TEDS) rely on strict structural alignment, leading to erroneous penalization of semantically correct but syntactically diverse outputs in generative document parsing. To address this, we propose SCORE—the first semantic evaluation framework tailored for generative document parsing. Our method introduces an *interpretation-agnostic* paradigm: (i) semantic-aware edit distance, (ii) token-level error decomposition, (iii) spatially tolerant table matching, and (iv) hierarchical consistency verification, all operating on a format-agnostic, normalized representation. Evaluated on 1,114 real-world document pages, SCORE significantly mitigates false penalties on ambiguous table layouts—reducing errors by 12–25% on average—thereby restoring fairness for semantically equivalent parses. It achieves a table-level F1 score of 0.93 and enables end-to-end assessment without requiring external detection modules.

Technology Category

Application Category

📝 Abstract

Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating generative document parsing systems with traditional metrics

Distinguishing semantic correctness from structural diversity in outputs

Providing fair assessment when valid interpretations differ structurally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adjusted edit distance for content fidelity

Token-level diagnostics for hallucination detection

Hierarchy-aware consistency checks with spatial tolerance

🔎 Similar Papers

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints