TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

📅 2024-08-17
🏛️ arXiv.org
📈 Citations: 12
Influential: 3
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) exhibit significantly degraded performance on real-world industrial tabular question answering (TQA), revealing critical deficiencies in complex table reasoning. Method: We introduce TableBench, an industrial-grade TQA benchmark featuring high complexity and multi-domain coverage—spanning 18 specialized domains and four core capability dimensions—and the first systematic characterization of table reasoning complexity in industrial settings. We further design a multidimensional, fine-grained evaluation framework, and release TableInstruct—a high-quality instruction-tuning dataset—and TableLLM, a lightweight domain-specific model. Contribution/Results: Experiments show GPT-4 achieves only 62.3% of human-level performance on TableBench; all open-source LLMs underperform humans by over 35 percentage points on average. TableLLM matches GPT-3.5’s performance, empirically validating the efficacy of dedicated modeling for tabular reasoning.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.
Problem

Research questions and friction points this paper is trying to address.

Addressing the gap between academic benchmarks and industrial table question answering needs.
Developing a comprehensive benchmark (TableBench) for evaluating table question answering in real-world scenarios.
Assessing the performance of LLMs, including GPT-4, against human-level table question answering capabilities.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed TableBench for complex TableQA benchmarking
Introduced TableLLM trained on TableInstruct dataset
Evaluated LLMs on real-world industrial table data
🔎 Similar Papers