Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing table-based question answering (QA) research is hindered by high annotation costs, insufficient coverage of complex reasoning scenarios, and structural heterogeneity of tables, which impedes systematic analysis of large language models’ (LLMs) reasoning failures. Method: We propose AutoT2T, an automated text-to-table generation framework that enables controllable synthesis of multi-variant, noisy, hierarchically structured tables from math word problems. We further introduce TabularGSM—the first benchmark explicitly designed to evaluate reasoning traps and structural complexity in table QA. Our approach integrates rule-based problem parsing, pattern-driven table generation, noise injection, and hierarchical complexity modeling. Contribution/Results: We identify a critical failure mechanism in LLMs: performance collapse due to tight coupling between reasoning and table retrieval. Experiments show that state-of-the-art LLMs exhibit severe deficits in collaborative reasoning on TabularGSM, validating both the benchmark’s diagnostic efficacy and the mechanistic insights uncovered.

Technology Category

Application Category

📝 Abstract
Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive evaluation for reasoning-intensive Table QA tasks
Dependence on costly manual data annotation for complex reasoning
Heterogeneous table structures hinder analysis of LLMs' reasoning failures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for table-based reasoning tasks
Generates noisy table variants for robustness
Benchmark spans table complexities and traps
🔎 Similar Papers
2024-06-25Conference on Empirical Methods in Natural Language ProcessingCitations: 2
Shi-Yu Tian
Shi-Yu Tian
Nanjing University
machine learning
Z
Zhi Zhou
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University
W
Wei Dong
School of Artificial Intelligence, Nanjing University
M
Ming Yang
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University
Kun-Yang Yu
Kun-Yang Yu
LAMDA Group, Nanjing University
Machine Learning
Z
Zi-Jian Cheng
National Key Laboratory for Novel Software Technology, Nanjing University; School of Intelligence Science and Technology, Nanjing University
Lan-Zhe Guo
Lan-Zhe Guo
LAMDA Group, Nanjing University
Machine Learning
Yu-Feng Li
Yu-Feng Li
Professor, Nanjing University
Machine Learning