The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current table understanding research lacks systematic benchmarks, hindering principled model selection and prompt engineering. To address this, we introduce ToRR, the first benchmark to prioritize robustness as a core evaluation dimension. ToRR encompasses ten cross-domain datasets and supports multiple input formats—including Markdown, HTML, and CSV—integrated with diverse prompting strategies and a standardized evaluation protocol. Our key findings are: (1) mainstream large language models exhibit pervasive fragility in table reasoning tasks; (2) multi-format joint evaluation yields significantly higher reliability than single-format testing, with gains equivalent to substantially scaling test coverage; and (3) multi-prompt assessment markedly improves result stability, and our ablation framework enables fine-grained attribution of performance factors. By emphasizing consistency and robustness alongside accuracy, ToRR advances table understanding evaluation from simplistic performance ranking toward a rigorous, scientifically grounded paradigm.

Technology Category

Application Category

📝 Abstract
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, that measures model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
Problem

Research questions and friction points this paper is trying to address.

Evaluates model performance on tabular data
Assesses robustness across table representation formats
Identifies challenges in table understanding and reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

ToRR benchmark for table tasks
Measures model robustness
Tests multiple table formats
🔎 Similar Papers
2024-06-25Conference on Empirical Methods in Natural Language ProcessingCitations: 2
2024-03-04Conference on Empirical Methods in Natural Language ProcessingCitations: 4