CTSyn: A Foundational Model for Cross Tabular Data Generation

📅 2024-06-07
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing generative foundation models face key bottlenecks in tabular data synthesis, including difficulty modeling heterogeneous features and the absence of cross-table joint generation. This paper introduces CTSyn—the first generative foundation model explicitly designed for cross-table scenarios. Its core contributions are: (1) a cross-table unified latent-space aggregation mechanism enabling collaborative representation learning across multi-source, heterogeneous tables; (2) a conditional latent-variable diffusion sampling framework supporting controllable, high-fidelity generation; and (3) a type-aware decoder that accurately reconstructs diverse field types—including numerical, categorical, and temporal attributes. Evaluated on multiple real-world datasets, CTSyn-generated data improves downstream task performance by 3.2–7.8% on average, surpassing real-data baselines for the first time—marking a significant advance in both utility and diversity of synthetic tabular data.

Technology Category

Application Category

📝 Abstract
Generative Foundation Models (GFMs) have produced synthetic data with remarkable quality in modalities such as images and text. However, applying GFMs to tabular data poses significant challenges due to the inherent heterogeneity of table features. Existing cross-table learning frameworks are hindered by the absence of both a generative model backbone and a decoding mechanism for heterogeneous feature values. To overcome these limitations, we introduce the Cross-Table Synthesizer (CTSyn), a diffusion-based foundational model tailored for tabular data generation. CTSyn introduces three major components: an aggregator that consolidates heterogeneous tables into a unified latent space; a conditional latent diffusion model for sampling from this space; and type-specific decoders that reconstruct values of varied data types from sampled latent vectors. Extensive testing on real-world datasets reveals that CTSyn not only significantly outperforms existing table synthesizers in utility and diversity, but also uniquely enhances performances of downstream machine learning beyond what is achievable with real data, thus establishing a new paradigm for synthetic data generation.
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic tabular data using diffusion models
Handles heterogeneous table features through unified encoding
Improves utility and diversity over existing table synthesizers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based generative model for tabular data
Autoencoder unifying tables into latent space
Conditional latent diffusion conditioned on table schema