π€ AI Summary
To address the low quality of synthetic tabular data and poor generalization under class imbalance or distributional skew in few-shot settings, this paper proposes TAEGANβa novel framework that introduces self-supervised pretraining to tabular data generation for the first time. TAEGAN employs a masked autoencoder (MAE) as the generator backbone and integrates it with a discriminative GAN architecture to jointly enhance fidelity and utility. This design balances accurate distribution modeling with downstream task adaptability. Evaluated on 10 benchmark datasets, TAEGAN outperforms existing deep generative methods in machine learning efficacy on 9 datasets; on 8 few-shot benchmarks, it achieves state-of-the-art data augmentation performance on 7. The core innovations are the MAE-GAN synergistic architecture and a self-supervised generative paradigm specifically tailored for tabular data, which significantly alleviate generation bottlenecks arising from data scarcity and distributional skew.
π Abstract
Synthetic tabular data generation has gained significant attention for its potential in data augmentation, software testing and privacy-preserving data sharing. However, most research has primarily focused on larger datasets and evaluating their quality in terms of metrics like column-wise statistical distributions and inter-feature correlations, while often overlooking its utility for data augmentation, particularly for datasets whose data is scarce. In this paper, we propose Tabular Auto-Encoder Generative Adversarial Network (TAEGAN), an improved GAN-based framework for generating high-quality tabular data. Although large language models (LLMs)-based methods represent the state-of-the-art in synthetic tabular data generation, they are often overkill for small datasets due to their extensive size and complexity. TAEGAN employs a masked auto-encoder as the generator, which for the first time introduces the power of self-supervised pre-training in tabular data generation so that essentially exposes the networks to more information. We extensively evaluate TAEGAN against five state-of-the-art synthetic tabular data generation algorithms. Results from 10 datasets show that TAEGAN outperforms existing deep-learning-based tabular data generation models on 9 out of 10 datasets on the machine learning efficacy and achieves superior data augmentation performance on 7 out of 8 smaller datasets.