🤖 AI Summary
Deep learning models exhibit weak representational capacity and poor generalization in table learning tasks involving free-text fields, primarily due to reliance on static, task-agnostic text representations.
Method: We propose a target-semantic-aware text-enhanced tabular learning framework: it unfreezes pretrained text encoders to jointly encode target tokens with structural table features, thereby dynamically generating task-adaptive text embeddings; the architecture is dataset-agnostic—requiring no dataset-specific parameters—and supports end-to-end joint pretraining across multiple datasets and cross-domain transfer.
Contribution/Results: Our method achieves state-of-the-art performance on multiple medium- to large-scale text-augmented tabular classification benchmarks. Crucially, we empirically discover, for the first time, a clear scaling law wherein pretraining performance improves consistently with the number of constituent datasets—demonstrating both scalability and universality of the approach.
📝 Abstract
While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.