TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the perceptual bottlenecks that impede multimodal large language models from accurately attending to spatial structures in complex hierarchical tables, thereby hindering high-level reasoning. To this end, we propose TableVision—a large-scale, trajectory-aware hierarchical evaluation benchmark that explicitly couples multi-step logical reasoning with pixel-level spatial ground truth through a rendering-driven deterministic localization mechanism. TableVision enables the first quantitative characterization of the “perceptual overload” phenomenon. We further introduce an explicit spatial constraint mechanism and a two-stage decoupled reasoning framework, achieving a 12.3% absolute improvement in overall accuracy on a high-fidelity test set comprising 6,799 reasoning trajectories, substantially unlocking the models’ reasoning potential.
📝 Abstract
Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Hierarchical Tables
Spatially Grounded Reasoning
Perception Bottleneck
Document Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception Bottleneck
Spatially Grounded Reasoning
Hierarchical Tables
Multimodal Large Language Models
Deterministic Grounding
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Chen
The Hong Kong University of Science and Technology (Guangzhou)
Lu Dai
Lu Dai
Hong Kong University of Science and Technology
Hanqing Wang
Hanqing Wang
HUST ➡ Shanghai AI lab ➡ HKUST(gz)
MLLMEmbodied AIWorld ModelVLA
Z
Zhuoyu Li
The Chinese University of Hong Kong
Wenbin Dai
Wenbin Dai
Shanghai Jiao Tong University
Industrial Edge ComputingIndustrial InformaticsAutomation Code GenerationIndustrial Control Software
Y
Yanzong Zheng
The Hong Kong University of Science and Technology (Guangzhou)
Z
Zhenggang Xia
The Hong Kong University of Science and Technology (Guangzhou)
J
Junyong Lin
The Hong Kong University of Science and Technology (Guangzhou)
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser