Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
Existing benchmarks for visual reasoning over tabular images suffer from limited scale, low diversity, insufficient reasoning depth, and poor rendering fidelity. To address these limitations, we introduce TabBench—the first large-scale, open-domain, multimodal benchmark for table image understanding, comprising 2.5K high-fidelity LaTeX-rendered tables and 6K deeply reasoned question-answer pairs. Our method features a modular, fully automated generation pipeline: leveraging role-specialized LLMs (for generation, verification, and prompting), strong-model-guided weak-model synthesis, and multi-stage filtering by an LLM-based adjudication committee—enabling high-quality data construction at low cost (<$100). By integrating visual structural modeling with rigorous multi-stage curation, our approach significantly improves vision-language model generalization on external benchmarks; fine-tuned models outperform several domain-specific baselines. All data and generation code are publicly released.

Technology Category

Application Category

📝 Abstract
Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking visual reasoning over table images
Addressing limited scale and diversity in datasets
Enhancing multimodal models' tabular data interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular scalable autonomous pipeline generation
Multi-LLM collaboration roles generation validation
Cross-model prompting inspiration jury filtering