TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limited multimodal reasoning capabilities of current foundation models when handling tables that exhibit both visual diversity and structural complexity, as well as the absence of a systematic evaluation benchmark for such scenarios. To bridge this gap, the authors introduce TableVista—the first multidimensional evaluation framework for table reasoning that integrates variations in visual styling with structural intricacy. By leveraging multi-style rendering, structural transformations, and perturbation-based generation, TableVista constructs a high-quality benchmark comprising 3,000 questions and 30,000 multimodal samples. Comprehensive evaluation of 29 state-of-the-art models reveals that while current approaches demonstrate robustness under diverse visual renderings, their performance significantly degrades in settings involving complex layouts or purely visual inputs, thereby exposing critical limitations in table understanding.

📝 Abstract

We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples for a multi-dimensional evaluation. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary foundation models on TableVista. Through comprehensive quantitative and qualitative analysis, we find that while evaluated models remain largely stable across diverse rendering styles, they exhibit pronounced performance degradation on complex structural layouts and vision-only settings, revealing that current models struggle to maintain reasoning consistency when structural complexity combines with visually integrated presentations. These findings highlight critical gaps in current multimodal capabilities, providing insights for advancing more robust and reliable table understanding models.

Problem

Research questions and friction points this paper is trying to address.

multimodal table reasoning

visual complexity

structural complexity

foundation models

table understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal table reasoning

visual complexity

structural complexity