TAEGAN: Generating Synthetic Tabular Data For Data Augmentation

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the low quality of synthetic tabular data and poor generalization under class imbalance or distributional skew in few-shot settings, this paper proposes TAEGAN—a novel framework that introduces self-supervised pretraining to tabular data generation for the first time. TAEGAN employs a masked autoencoder (MAE) as the generator backbone and integrates it with a discriminative GAN architecture to jointly enhance fidelity and utility. This design balances accurate distribution modeling with downstream task adaptability. Evaluated on 10 benchmark datasets, TAEGAN outperforms existing deep generative methods in machine learning efficacy on 9 datasets; on 8 few-shot benchmarks, it achieves state-of-the-art data augmentation performance on 7. The core innovations are the MAE-GAN synergistic architecture and a self-supervised generative paradigm specifically tailored for tabular data, which significantly alleviate generation bottlenecks arising from data scarcity and distributional skew.

Technology Category

Application Category

📝 Abstract

Synthetic tabular data generation has gained significant attention for its potential in data augmentation, software testing and privacy-preserving data sharing. However, most research has primarily focused on larger datasets and evaluating their quality in terms of metrics like column-wise statistical distributions and inter-feature correlations, while often overlooking its utility for data augmentation, particularly for datasets whose data is scarce. In this paper, we propose Tabular Auto-Encoder Generative Adversarial Network (TAEGAN), an improved GAN-based framework for generating high-quality tabular data. Although large language models (LLMs)-based methods represent the state-of-the-art in synthetic tabular data generation, they are often overkill for small datasets due to their extensive size and complexity. TAEGAN employs a masked auto-encoder as the generator, which for the first time introduces the power of self-supervised pre-training in tabular data generation so that essentially exposes the networks to more information. We extensively evaluate TAEGAN against five state-of-the-art synthetic tabular data generation algorithms. Results from 10 datasets show that TAEGAN outperforms existing deep-learning-based tabular data generation models on 9 out of 10 datasets on the machine learning efficacy and achieves superior data augmentation performance on 7 out of 8 smaller datasets.

Problem

Research questions and friction points this paper is trying to address.

Generates synthetic tabular data for augmentation and privacy

Addresses GAN instability and limited information exposure in training

Handles imbalanced or skewed data distributions effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked auto-encoder as GAN generator

Introduces self-supervised warmup training for generator

Proposes novel sampling for imbalanced data and improved loss

🔎 Similar Papers

MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data