MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing table understanding benchmarks focus narrowly on tasks like NL-to-SQL and Table-QA, failing to capture the multidimensional requirements of expert users in real-world scenarios. Method: We introduce TabBench—the first large-scale, expert-oriented, multi-task benchmark for table understanding—comprising 25 realistic business tasks (e.g., database querying, spreadsheet manipulation, computational notebook analysis) and over 30,000 human-annotated, domain-expert-validated questions. TabBench systematically integrates decades of table research, defining a unified evaluation paradigm spanning semantic comprehension, logical reasoning, and code generation. Contribution/Results: Experiments reveal that state-of-the-art models—including o4-mini and DeepSeek R1—achieve only ~60% average accuracy, exposing critical limitations in large language models’ ability to perform deep, structured data processing. TabBench thus establishes a new standard for rigorous, diagnostic evaluation and advancement in table understanding.

Technology Category

Application Category

📝 Abstract
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmarks for table understanding tasks
Limited evaluation of table-related skills in real-world applications
Challenges in expert-level table reasoning and manipulation for LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark with 30K questions
25 real-world table tasks evaluation
Combines table understanding, reasoning, coding
🔎 Similar Papers
No similar papers found.