🤖 AI Summary
This work challenges the prevailing assumption that Table Foundation Models (TFMs) require large-scale synthetic or real-world pretraining data to achieve generalization. Method: We propose a lightweight self-supervised pretraining framework that learns structured semantics from a single real-world table, combined with in-context learning for zero-shot cross-domain transfer—without external corpora or additional annotations. Contribution/Results: We demonstrate that the quality and diversity of task construction—not data scale—are the primary determinants of TFM performance. Evaluated across heterogeneous downstream benchmarks spanning finance, healthcare, and e-commerce, our approach significantly outperforms existing few-shot baselines. These results validate the effectiveness and scalability of the “single-table pretraining + in-context learning” paradigm, establishing a novel, resource-efficient framework for tabular modeling in low-data regimes.
📝 Abstract
Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of emph{tasks} one can construct from a dataset is key to downstream performance.