Representation Learning for Tabular Data: A Comprehensive Survey

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This survey addresses fundamental challenges in tabular data representation learning—namely, heterogeneous feature coupling, sample sparsity, and weak task generalization—hindering effective modeling of structured tabular information with deep neural networks. To tackle these issues, we propose a three-tiered methodological framework—*specialized*, *transferable*, and *general-purpose*—and introduce the first taxonomy categorizing models along three dimensions: features, samples, and prediction objectives. We formally define and classify transferable models and tabular foundation models, and unify recent advances in multimodal alignment, open-world adaptation, and self-supervised pretraining. Our work yields the first structured, extensible landscape of tabular representation learning, accompanied by an open-source repository (GitHub) containing curated resources, benchmarks, and implementation guidelines. This synthesis provides both theoretical foundations and practical blueprints for algorithm design and industrial deployment. (138 words)

Technology Category

Application Category

📝 Abstract
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data -- features, samples, and objectives -- and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: https://github.com/LAMDA-Tabular/Tabular-Survey.
Problem

Research questions and friction points this paper is trying to address.

Surveying representation learning methods for tabular data
Comparing DNN-based models for tabular data classification
Organizing models by generalization: specialized, transferable, general
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Neural Networks for tabular representation learning
Hierarchical taxonomy for specialized tabular models
General tabular models without fine-tuning
🔎 Similar Papers
No similar papers found.
Jun-Peng Jiang
Jun-Peng Jiang
Ph.D student at Nanjing University
Tabular Data LearningMultimodal LearningMLLMs
Si-Yang Liu
Si-Yang Liu
Nanjing University
Machine LearningTabular DataLLMs
H
Hao-Run Cai
School of Artificial Intelligence, Nanjing University, and National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Q
Qile Zhou
School of Artificial Intelligence, Nanjing University, and National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Han-Jia Ye
Han-Jia Ye
Nanjing University
Machine LearningData MiningMetric LearningMeta-Learning