Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study reveals that large language models’ (LLMs) strong performance on public tabular datasets often stems from data contamination—i.e., memorization of semantic cues (e.g., column names, value distributions) encountered during training—rather than genuine reasoning. To isolate memory from reasoning, we design controlled probing experiments employing semantic denoising and column-name randomization. Our results show that removing such semantic cues collapses model accuracy to chance level, exposing severe overestimation of generalization capability. We introduce the novel concept of *semantic contamination bias* and propose a principled evaluation paradigm that explicitly disentangles memorization from reasoning. This framework establishes a methodological foundation and empirical basis for rigorous LLM assessment on structured data. (126 words)

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' latent knowledge of public tabular datasets

Investigating dataset contamination effects in tabular benchmarks

Disentangling memorization from genuine reasoning in evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing LLMs for dataset contamination via controlled experiments

Revealing performance decline with randomized semantic cues

Proposing strategies to separate memorization from true reasoning

🔎 Similar Papers

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering