🤖 AI Summary
Existing deep learning methods for tabular data employ separate, high-dimensional embeddings for numerical and categorical features, hindering pre-trained knowledge transfer and cross-table generalization. This work proposes Tab2Text: a paradigm that systematically converts raw tabular data into natural language sequences, leveraging large language models (LLMs) to produce unified, semantically rich embeddings seamlessly integrated into mainstream architectures—including MLPs, ResNets, and FT-Transformers. To our knowledge, this is the first approach to directly incorporate LLM-derived pre-trained representations into end-to-end tabular deep learning, overcoming the limitations of traditional feature-wise encoding. By enabling holistic semantic modeling and plug-and-play enhancement, Tab2Text achieves state-of-the-art performance across seven standard classification benchmarks, consistently outperforming strong baselines such as MLP, ResNet, and FT-Transformer, with significant gains in predictive accuracy.
📝 Abstract
Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.