Generalization Can Emerge in Tabular Foundation Models From a Single Table

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work challenges the prevailing assumption that Table Foundation Models (TFMs) require large-scale synthetic or real-world pretraining data to achieve generalization. Method: We propose a lightweight self-supervised pretraining framework that learns structured semantics from a single real-world table, combined with in-context learning for zero-shot cross-domain transfer—without external corpora or additional annotations. Contribution/Results: We demonstrate that the quality and diversity of task construction—not data scale—are the primary determinants of TFM performance. Evaluated across heterogeneous downstream benchmarks spanning finance, healthcare, and e-commerce, our approach significantly outperforms existing few-shot baselines. These results validate the effectiveness and scalability of the “single-table pretraining + in-context learning” paradigm, establishing a novel, resource-efficient framework for tabular modeling in low-data regimes.

Technology Category

Application Category

📝 Abstract

Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of emph{tasks} one can construct from a dataset is key to downstream performance.

Problem

Research questions and friction points this paper is trying to address.

Challenges the need for massive data in tabular foundation models

Explores generalization from single-table self-supervised pre-training

Identifies task quantity and quality as key to cross-domain performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pre-training on single table

Transfer learning across heterogeneous benchmarks

Task construction quantity and quality matters

🔎 Similar Papers

TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features