TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VTU datasets predominantly rely on synthetic tables, lacking visual diversity and structural complexity inherent in real-world scenarios, and feature fixed, non-paraphrasable instructions. To address these limitations, we propose TABLET—the first large-scale, real-world visual table understanding dataset—comprising 4 million samples, 2 million unique tables, and 20 diverse tasks; 88% retain their original visual formatting. TABLET provides image–HTML pairs, rich metadata, and provenance information, enabling the first systematic, traceable acquisition and construction of real-world tables. It supports multi-task instruction rewriting and cross-task generalization evaluation. We conduct end-to-end pixel-to-HTML training on vision-language models (e.g., Qwen2.5-VL-7B), significantly improving model robustness on both seen and unseen tasks, as well as practical deployment capability.

Technology Category

Application Category

📝 Abstract
While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack real-world table complexity and visual diversity
Current VTU datasets provide fixed examples without underlying serialized data
There is no unified large-scale dataset for robust visual table understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with paired image-HTML representations
Preserves original visualizations and source provenance
Fine-tunes vision-language models for robust table understanding
🔎 Similar Papers
No similar papers found.
Iñigo Alonso
Iñigo Alonso
Research Associate, University of Edinburgh
Natural Language ProcessingComputational LinguisticsMachine LearningNatural Language
I
Imanol Miranda
HiTZ Center – Ixa, University of the Basque Country UPV/EHU
E
Eneko Agirre
HiTZ Center – Ixa, University of the Basque Country UPV/EHU
Mirella Lapata
Mirella Lapata
School of Informatics, Edinburgh University
natural language processing