LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing document layout analysis methods, which struggle to detect structural errors—such as region merging, splitting, or omission—and are evaluated using conventional geometry-based overlap metrics that fail to capture such logical inconsistencies. To bridge this gap, the authors introduce the LED benchmark, which systematically defines eight canonical layout error types, devises interpretable error-injection algorithms, and constructs a dataset enabling both document-level and element-level error detection and classification. This benchmark establishes a fine-grained, multimodal-compatible framework for evaluating layout robustness beyond traditional assessment paradigms. Experimental results reveal significant structural reasoning deficiencies in state-of-the-art multimodal models on LED, thereby demonstrating the benchmark’s effectiveness and necessity.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.
Problem

Research questions and friction points this paper is trying to address.

Document Layout Analysis
Layout Error Detection
Structural Errors
Evaluation Benchmark
Logical Inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layout Error Detection
Document Layout Analysis
Structural Reasoning
Error Injection
Multimodal Benchmark
🔎 Similar Papers
No similar papers found.
I
Inbum Heo
Department of Computer Engineering, Chungnam National Univ., Daejeon, South Korea
T
Taewook Hwang
Department of Computer Engineering, Chungnam National Univ., Daejeon, South Korea
Jeesu Jung
Jeesu Jung
Chungnam National University
Natural Language Processing
Sangkeun Jung
Sangkeun Jung
Chungnam National University
artificial intelligencenatural language processingmachine learning