TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing tabular reasoning benchmarks lack fair, fine-grained evaluation of both shallow comprehension and deep reasoning capabilities. To address this, we propose TReB—the first comprehensive benchmark covering 26 diverse subtasks—and introduce a multidimensional evaluation framework. We innovatively design three reasoning paradigms: TCoT (Tabular Chain-of-Thought), PoT (Tabular Program-of-Thought), and ICoT (Interactive Chain-of-Thought), enabling robust analysis under realistic scenarios. Leveraging iterative data cleaning and high-quality prompt engineering, we construct a high signal-to-noise-ratio tabular dataset. Systematic evaluation across over 20 state-of-the-art large language models reveals persistent deficiencies in complex tabular reasoning. The TReB dataset and evaluation framework are publicly released, providing a reproducible, extensible, and standardized toolkit for advancing tabular intelligence research.

Technology Category

Application Category

📝 Abstract
The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on [HuggingFace] and the framework on [GitHub].
Problem

Research questions and friction points this paper is trying to address.

Lack of benchmark for evaluating LLMs' table reasoning abilities
Challenges in reasoning with complex structured table data
Need for comprehensive evaluation of shallow and deep table reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive table reasoning benchmark TReB
High quality dataset via iterative processing
Three inference modes: TCoT, PoT, ICoT
Ce Li
Ce Li
CUMTB
Video UnderstandingBehavior AnalysisEvent Detection
X
Xiaofan Liu
JIUTIAN Team, China Mobile Research Institute , Beijing, China
Z
Zhiyan Song
JIUTIAN Team, China Mobile Research Institute , Beijing, China
C
Ce Chi
JIUTIAN Team, China Mobile Research Institute , Beijing, China
C
Chen Zhao
JIUTIAN Team, China Mobile Research Institute , Beijing, China
J
Jingjing Yang
JIUTIAN Team, China Mobile Research Institute , Beijing, China
Z
Zhendong Wang
JIUTIAN Team, China Mobile Research Institute , Beijing, China
K
Kexin Yang
JIUTIAN Team, China Mobile Research Institute , Beijing, China
Boshen Shi
Boshen Shi
中移九天人工智能研究院
Graph Neural NetworksTransfer LearningTable Mining
X
Xing Wang
JIUTIAN Team, China Mobile Research Institute , Beijing, China
C
Chao Deng
JIUTIAN Team, China Mobile Research Institute , Beijing, China
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining