TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the limitations of existing table recognition methods, which often decouple structural and content modeling, and suffer from data scarcity in low-resource settings due to their reliance on large annotated datasets. To overcome these challenges, the authors propose a “Perceive-and-Fuse” framework that jointly models table structure and content through detail-aware perception learning. The approach integrates a structure-guided cell localization module and a cell-level visual alignment mechanism, enabling efficient end-to-end recognition. Formulated within a language modeling paradigm, the method performs multi-task learning to implicitly fuse fine-grained table details and directly generates HTML output without requiring dataset-specific fine-tuning. Evaluated on seven benchmark datasets, the model achieves state-of-the-art or highly competitive performance, significantly outperforming current end-to-end approaches while enhancing robustness and interpretability in low-resource scenarios.

Technology Category

Application Category

📝 Abstract

Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse'' strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

table recognition

modular pipeline

end-to-end learning

data-constrained scenario

structure-content integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

table recognition

end-to-end learning

cell-level alignment