Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address the high annotation cost and poor generalization in document table detection, this paper proposes the first end-to-end controllable LaTeX-driven synthetic framework, enabling realistic two-column document image generation with adjustable layout, styling, and resolution, alongside precise pixel-level table masks. The framework integrates geometric layout randomization, high-fidelity rendering, and the TableNet segmentation model to support systematic model training and evaluation without real annotated data. On synthetic test sets, it achieves XOR errors of 4.04% (256×256) and 4.33% (1024×1024); on the real-world Marmot benchmark, it attains 9.18%, significantly outperforming prior methods. This work pioneers the deep integration of controllable synthesis with pixel-level evaluation, establishing a high-quality synthetic data paradigm and a reproducible evaluation benchmark for document understanding.

Technology Category

Application Category

📝 Abstract

Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.

Problem

Research questions and friction points this paper is trying to address.

Automating table detection in document images

Reducing manual annotation effort for table extraction

Evaluating TableNet's performance with synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated LaTeX-based synthetic data pipeline

Generates diverse table layouts with ground-truth

Reduces manual annotation with synthetic augmentation

🔎 Similar Papers

No similar papers found.

Bosch Group

Attraktive Vergütung

Horb am Neckar, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)