🤖 AI Summary
Deep learning modeling for tabular data remains challenging due to irregular schema structures and heterogeneous, sparse information distributions. Method: This paper systematically surveys 127 top-tier conference and journal papers published since 2020, proposing a unified analytical framework for tabular representation learning grounded in three pillars: training data, neural architecture, and learning objective. Contribution/Results: It introduces the first holistic “triadic” analysis paradigm, emphasizing cross-task generalizability and robustness. The framework comprehensively covers key advances—including data augmentation, specialized architectures (e.g., FT-Transformer), self-supervised pretraining, contrastive learning, and multi-task optimization. It identifies critical research gaps, distills fundamental challenges and evolutionary trends, and establishes standardized evaluation dimensions. Collectively, this work provides both theoretical foundations and practical guidelines for developing general-purpose, robust, and interpretable deep learning methods for tabular data.
📝 Abstract
Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies. This survey provides a comprehensive review of state-of-the-art techniques in tabular data representation learning, structured around three foundational design elements: training data, neural architectures, and learning objectives. Unlike prior surveys that focus primarily on either architecture design or learning strategies, we adopt a holistic perspective that emphasizes the universality and robustness of representation learning methods across diverse downstream tasks. We examine recent advances in data augmentation and generation, specialized neural network architectures tailored to tabular data, and innovative learning objectives that enhance representation quality. Additionally, we highlight the growing influence of self-supervised learning and the adaptation of transformer-based foundation models for tabular data. Our review is based on a systematic literature search using rigorous inclusion criteria, encompassing 127 papers published since 2020 in top-tier conferences and journals. Through detailed analysis and comparison, we identify emerging trends, critical gaps, and promising directions for future research, aiming to guide the development of more generalizable and effective tabular data representation methods.