How well do LLMs reason over tabular data, really?

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work systematically evaluates the reasoning robustness of general-purpose large language models (LLMs) on real-world tabular data, focusing on three prevalent challenges: missing values, duplicate entities, and structural variations. To address the severe biases of conventional metrics (e.g., BLEU, BERTScore) in tabular reasoning tasks, we propose a novel “LLM-as-a-judge” evaluation paradigm that leverages LLMs’ semantic understanding for more faithful assessment. We further introduce a multidimensional tabular perturbation benchmark—incorporating missing-value injection, entity deduplication/reuse, and structural deformation—to quantitatively measure performance degradation under realistic corruptions. Our empirical study reveals an average accuracy drop exceeding 40% across these perturbations, while standard metrics overestimate true performance by 2–3×. These findings provide critical empirical evidence and methodological foundations for designing and evaluating robust tabular reasoning models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM's realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM's performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' reasoning robustness on realistic tabular data variations

Evaluating LLMs' performance on analytical tabular queries accurately

Addressing shortcomings in current tabular reasoning benchmarks and metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge for reliable performance evaluation

Extending tabular inputs with realistic variations

Assessing robustness to missing values and duplicates

🔎 Similar Papers

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering