🤖 AI Summary
This study systematically investigates the distributional discrepancies among three types of tabular pretraining corpora—web-crawled, human-curated, and synthetic data—and their impact on downstream performance. Leveraging table-level, column-level, and relational features, the authors conduct a multidimensional analysis using discriminator AUC, k-NN coverage, and internal representation similarity. They reveal, for the first time, that synthetic data (TabICL) occupies a narrow region within the real-world tabular distribution space, and this gap cannot be mitigated through hyperparameter optimization. In contrast, human-curated and web-crawled data exhibit nearly equivalent distributions. Despite this pronounced distributional divergence, models trained on synthetic data maintain competitive generalization on downstream tasks, suggesting that comprehensive coverage of real-world tabular distributions is not a necessary condition for effective generalization.
📝 Abstract
Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.