Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high annotation cost and poor generalization in document table detection, this paper proposes the first end-to-end controllable LaTeX-driven synthetic framework, enabling realistic two-column document image generation with adjustable layout, styling, and resolution, alongside precise pixel-level table masks. The framework integrates geometric layout randomization, high-fidelity rendering, and the TableNet segmentation model to support systematic model training and evaluation without real annotated data. On synthetic test sets, it achieves XOR errors of 4.04% (256×256) and 4.33% (1024×1024); on the real-world Marmot benchmark, it attains 9.18%, significantly outperforming prior methods. This work pioneers the deep integration of controllable synthesis with pixel-level evaluation, establishing a high-quality synthetic data paradigm and a reproducible evaluation benchmark for document understanding.

Technology Category

Application Category

📝 Abstract
Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.
Problem

Research questions and friction points this paper is trying to address.

Automating table detection in document images
Reducing manual annotation effort for table extraction
Evaluating TableNet's performance with synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated LaTeX-based synthetic data pipeline
Generates diverse table layouts with ground-truth
Reduces manual annotation with synthetic augmentation
🔎 Similar Papers
No similar papers found.
K
Krishna Sahukara
Deggendorf Institute of Technology
Zineddine Bettouche
Zineddine Bettouche
Deggendorf Institute of Technology
machine learningnatural language processingtransformer modelsimage processing
A
Andreas Fischer
Deggendorf Institute of Technology